Data Science and Business Analytics
Project 4 - Supervised Learning Classification: INN Hotels
Jorge Ramon Vazquez Campero

1. Problem Statement and Business Context¶
1.1 Business Context¶
A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.
The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.
The cancellation of bookings impact a hotel on various fronts:
- Loss of resources (revenue) when the hotel cannot resell the room.
- Additional costs of distribution channels by increasing commissions or paying for publicity to help sell these rooms.
- Lowering prices last minute, so the hotel can resell a room, resulting in reducing the profit margin.
- Human resources to make arrangements for the guests.
1.2 Objective¶
The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.
1.3 Data Description¶
The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.
Data Dictionary
Booking_ID: unique identifier of each bookingno_of_adults: Number of adultsno_of_children: Number of Childrenno_of_weekend_nights: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotelno_of_week_nights: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hoteltype_of_meal_plan: Type of meal plan booked by the customer:Not Selected– No meal plan selectedMeal Plan 1– BreakfastMeal Plan 2– Half board (breakfast and one other meal)Meal Plan 3– Full board (breakfast, lunch, and dinner)
required_car_parking_space: Does the customer require a car parking space? (0 - No, 1- Yes)room_type_reserved: Type of room reserved by the customer. The values are ciphered (encoded) by INN Hotels.lead_time: Number of days between the date of booking and the arrival datearrival_year: Year of arrival datearrival_month: Month of arrival datearrival_date: Date of the monthmarket_segment_type: Market segment designation.repeated_guest: Is the customer a repeated guest? (0 - No, 1- Yes)no_of_previous_cancellations: Number of previous bookings that were canceled by the customer prior to the current bookingno_of_previous_bookings_not_canceled: Number of previous bookings not canceled by the customer prior to the current bookingavg_price_per_room: Average price per day of the reservation; prices of the rooms are dynamic. (in euros)no_of_special_requests: Total number of special requests made by the customer (e.g. high floor, view from the room, etc)booking_status: Flag indicating if the booking was canceled or not.
Importing necessary libraries and data¶
# Installing the libraries with the specified version.
!pip install pandas==1.5.3 numpy==1.25.2 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 statsmodels==0.14.1 -q --user
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.
# Import libraries for data manipulation
import numpy as np
import pandas as pd
# Import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
sns.set()
# Suppress warnings to prevent them from being displayed during code execution
import warnings
warnings.filterwarnings('ignore')
# Enable the inline plotting of matplotlib figures directly within the notebook
%matplotlib inline
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# setting the precision of floating numbers to 5 decimal points
pd.set_option("display.float_format", lambda x: "%.5f" % x)
# Library to split data
from sklearn.model_selection import train_test_split
# To build model for prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To tune different models
from sklearn.model_selection import GridSearchCV
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
precision_recall_curve,
roc_curve,
make_scorer,
)
# Suppress warnings to prevent them from being displayed during code execution
import warnings
warnings.filterwarnings('ignore')
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter("ignore", ConvergenceWarning)
# Load the dataset
data = pd.read_csv("INNHotelsGroup.csv")
# Copying data to another variable to avoid any changes to original data
df = data.copy()
2. Data Overview¶
- Observations
- Sanity checks
# Display the first few rows of the dataset
display(df.head())
# Display the last few rows of the dataset
display(df.tail())
| Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | INN00001 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 224 | 2017 | 10 | 2 | Offline | 0 | 0 | 0 | 65.00000 | 0 | Not_Canceled |
| 1 | INN00002 | 2 | 0 | 2 | 3 | Not Selected | 0 | Room_Type 1 | 5 | 2018 | 11 | 6 | Online | 0 | 0 | 0 | 106.68000 | 1 | Not_Canceled |
| 2 | INN00003 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2018 | 2 | 28 | Online | 0 | 0 | 0 | 60.00000 | 0 | Canceled |
| 3 | INN00004 | 2 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 211 | 2018 | 5 | 20 | Online | 0 | 0 | 0 | 100.00000 | 0 | Canceled |
| 4 | INN00005 | 2 | 0 | 1 | 1 | Not Selected | 0 | Room_Type 1 | 48 | 2018 | 4 | 11 | Online | 0 | 0 | 0 | 94.50000 | 0 | Canceled |
| Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 36270 | INN36271 | 3 | 0 | 2 | 6 | Meal Plan 1 | 0 | Room_Type 4 | 85 | 2018 | 8 | 3 | Online | 0 | 0 | 0 | 167.80000 | 1 | Not_Canceled |
| 36271 | INN36272 | 2 | 0 | 1 | 3 | Meal Plan 1 | 0 | Room_Type 1 | 228 | 2018 | 10 | 17 | Online | 0 | 0 | 0 | 90.95000 | 2 | Canceled |
| 36272 | INN36273 | 2 | 0 | 2 | 6 | Meal Plan 1 | 0 | Room_Type 1 | 148 | 2018 | 7 | 1 | Online | 0 | 0 | 0 | 98.39000 | 2 | Not_Canceled |
| 36273 | INN36274 | 2 | 0 | 0 | 3 | Not Selected | 0 | Room_Type 1 | 63 | 2018 | 4 | 21 | Online | 0 | 0 | 0 | 94.50000 | 0 | Canceled |
| 36274 | INN36275 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 207 | 2018 | 12 | 30 | Offline | 0 | 0 | 0 | 161.67000 | 0 | Not_Canceled |
# Display the shape of the dataset
df.shape
print("There are", df.shape[0], "rows and", df.shape[1], "columns.\n\n")
# Display the data types of the columns in the dataset
display(df.info())
There are 36275 rows and 19 columns. <class 'pandas.core.frame.DataFrame'> RangeIndex: 36275 entries, 0 to 36274 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Booking_ID 36275 non-null object 1 no_of_adults 36275 non-null int64 2 no_of_children 36275 non-null int64 3 no_of_weekend_nights 36275 non-null int64 4 no_of_week_nights 36275 non-null int64 5 type_of_meal_plan 36275 non-null object 6 required_car_parking_space 36275 non-null int64 7 room_type_reserved 36275 non-null object 8 lead_time 36275 non-null int64 9 arrival_year 36275 non-null int64 10 arrival_month 36275 non-null int64 11 arrival_date 36275 non-null int64 12 market_segment_type 36275 non-null object 13 repeated_guest 36275 non-null int64 14 no_of_previous_cancellations 36275 non-null int64 15 no_of_previous_bookings_not_canceled 36275 non-null int64 16 avg_price_per_room 36275 non-null float64 17 no_of_special_requests 36275 non-null int64 18 booking_status 36275 non-null object dtypes: float64(1), int64(13), object(5) memory usage: 5.3+ MB
None
Observations:
- There are 5 Categorical features;
Booking_ID,type_of_meal_plan,room_type_reserved,market_segment_type,booking_status. - The rest of the features (13) are of numerical nature.
- We will also have to drop the
Booking_IDcolumn since it is irrelevant for the context of the data.
# Dropping Booking ID column
df.drop("Booking_ID", axis=1, inplace=True)
# Display the statistical summary for the numerical dataset
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| no_of_adults | 36275.00000 | 1.84496 | 0.51871 | 0.00000 | 2.00000 | 2.00000 | 2.00000 | 4.00000 |
| no_of_children | 36275.00000 | 0.10528 | 0.40265 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 10.00000 |
| no_of_weekend_nights | 36275.00000 | 0.81072 | 0.87064 | 0.00000 | 0.00000 | 1.00000 | 2.00000 | 7.00000 |
| no_of_week_nights | 36275.00000 | 2.20430 | 1.41090 | 0.00000 | 1.00000 | 2.00000 | 3.00000 | 17.00000 |
| required_car_parking_space | 36275.00000 | 0.03099 | 0.17328 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 |
| lead_time | 36275.00000 | 85.23256 | 85.93082 | 0.00000 | 17.00000 | 57.00000 | 126.00000 | 443.00000 |
| arrival_year | 36275.00000 | 2017.82043 | 0.38384 | 2017.00000 | 2018.00000 | 2018.00000 | 2018.00000 | 2018.00000 |
| arrival_month | 36275.00000 | 7.42365 | 3.06989 | 1.00000 | 5.00000 | 8.00000 | 10.00000 | 12.00000 |
| arrival_date | 36275.00000 | 15.59700 | 8.74045 | 1.00000 | 8.00000 | 16.00000 | 23.00000 | 31.00000 |
| repeated_guest | 36275.00000 | 0.02564 | 0.15805 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 |
| no_of_previous_cancellations | 36275.00000 | 0.02335 | 0.36833 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 13.00000 |
| no_of_previous_bookings_not_canceled | 36275.00000 | 0.15341 | 1.75417 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 58.00000 |
| avg_price_per_room | 36275.00000 | 103.42354 | 35.08942 | 0.00000 | 80.30000 | 99.45000 | 120.00000 | 540.00000 |
| no_of_special_requests | 36275.00000 | 0.61966 | 0.78624 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 5.00000 |
# Display the statistical summary for the categorical dataset
df.describe(include="object").T
| count | unique | top | freq | |
|---|---|---|---|---|
| type_of_meal_plan | 36275 | 4 | Meal Plan 1 | 27835 |
| room_type_reserved | 36275 | 7 | Room_Type 1 | 28130 |
| market_segment_type | 36275 | 5 | Online | 23214 |
| booking_status | 36275 | 2 | Not_Canceled | 24390 |
# Making a list of all categorical variables ()'object' or 'category')
cat_cols = df.select_dtypes(include=["object", "category"]).columns
# Iterate through each categorical column and print the count of unique categorical levels, followed by a separator line.
for column in cat_cols:
print(df[column].value_counts())
print("-" * 50)
type_of_meal_plan Meal Plan 1 27835 Not Selected 5130 Meal Plan 2 3305 Meal Plan 3 5 Name: count, dtype: int64 -------------------------------------------------- room_type_reserved Room_Type 1 28130 Room_Type 4 6057 Room_Type 6 966 Room_Type 2 692 Room_Type 5 265 Room_Type 7 158 Room_Type 3 7 Name: count, dtype: int64 -------------------------------------------------- market_segment_type Online 23214 Offline 10528 Corporate 2017 Complementary 391 Aviation 125 Name: count, dtype: int64 -------------------------------------------------- booking_status Not_Canceled 24390 Canceled 11885 Name: count, dtype: int64 --------------------------------------------------
# Making a list of all numerical variables ('int64', 'float64', 'complex')
num_cols = df.select_dtypes(include=["int64", "float64", "complex"]).columns
# Iterate through each numerical column and print summary statistics, followed by a separator line.
for column in num_cols:
print(df[column].describe())
print("-" * 50)
count 36275.00000 mean 1.84496 std 0.51871 min 0.00000 25% 2.00000 50% 2.00000 75% 2.00000 max 4.00000 Name: no_of_adults, dtype: float64 -------------------------------------------------- count 36275.00000 mean 0.10528 std 0.40265 min 0.00000 25% 0.00000 50% 0.00000 75% 0.00000 max 10.00000 Name: no_of_children, dtype: float64 -------------------------------------------------- count 36275.00000 mean 0.81072 std 0.87064 min 0.00000 25% 0.00000 50% 1.00000 75% 2.00000 max 7.00000 Name: no_of_weekend_nights, dtype: float64 -------------------------------------------------- count 36275.00000 mean 2.20430 std 1.41090 min 0.00000 25% 1.00000 50% 2.00000 75% 3.00000 max 17.00000 Name: no_of_week_nights, dtype: float64 -------------------------------------------------- count 36275.00000 mean 0.03099 std 0.17328 min 0.00000 25% 0.00000 50% 0.00000 75% 0.00000 max 1.00000 Name: required_car_parking_space, dtype: float64 -------------------------------------------------- count 36275.00000 mean 85.23256 std 85.93082 min 0.00000 25% 17.00000 50% 57.00000 75% 126.00000 max 443.00000 Name: lead_time, dtype: float64 -------------------------------------------------- count 36275.00000 mean 2017.82043 std 0.38384 min 2017.00000 25% 2018.00000 50% 2018.00000 75% 2018.00000 max 2018.00000 Name: arrival_year, dtype: float64 -------------------------------------------------- count 36275.00000 mean 7.42365 std 3.06989 min 1.00000 25% 5.00000 50% 8.00000 75% 10.00000 max 12.00000 Name: arrival_month, dtype: float64 -------------------------------------------------- count 36275.00000 mean 15.59700 std 8.74045 min 1.00000 25% 8.00000 50% 16.00000 75% 23.00000 max 31.00000 Name: arrival_date, dtype: float64 -------------------------------------------------- count 36275.00000 mean 0.02564 std 0.15805 min 0.00000 25% 0.00000 50% 0.00000 75% 0.00000 max 1.00000 Name: repeated_guest, dtype: float64 -------------------------------------------------- count 36275.00000 mean 0.02335 std 0.36833 min 0.00000 25% 0.00000 50% 0.00000 75% 0.00000 max 13.00000 Name: no_of_previous_cancellations, dtype: float64 -------------------------------------------------- count 36275.00000 mean 0.15341 std 1.75417 min 0.00000 25% 0.00000 50% 0.00000 75% 0.00000 max 58.00000 Name: no_of_previous_bookings_not_canceled, dtype: float64 -------------------------------------------------- count 36275.00000 mean 103.42354 std 35.08942 min 0.00000 25% 80.30000 50% 99.45000 75% 120.00000 max 540.00000 Name: avg_price_per_room, dtype: float64 -------------------------------------------------- count 36275.00000 mean 0.61966 std 0.78624 min 0.00000 25% 0.00000 50% 0.00000 75% 1.00000 max 5.00000 Name: no_of_special_requests, dtype: float64 --------------------------------------------------
3. Exploratory Data Analysis (EDA)¶
- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
Note: We will mainly focus on the leading questions next. The detailed EDA can be found in the Appendix.
Leading Questions:
- What are the busiest months in the hotel?
- Which market segment do most of the guests come from?
- Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
- What percentage of bookings are canceled?
- Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
- Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?
The functions below need to be defined to carry out the EDA.
# Function to create labeled barplots
def labeled_barplot(data, feature, figsize=(20, 6), perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
plt.figure(figsize=figsize) # set plot size
ax = sns.countplot(
data=data,
x=feature,
palette="dark",
order=data[feature].value_counts().index[:n],
)
plt.title(f"Distribution of {feature}", fontsize=16) # add title
plt.xlabel(feature, fontsize=14) # add x-axis label
plt.ylabel("Count", fontsize=14) # add y-axis label
for p in ax.patches:
if perc:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.xticks(rotation=90, fontsize=12) # rotate x-axis labels
plt.yticks(fontsize=12) # set y-axis labels size
plt.show() # show the plot
# Function to display a histogram and boxplot combined
def histogram_boxplot(data, feature, figsize=(12, 6), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
(
sns.histplot(data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins)
if bins
else sns.histplot(data=data, x=feature, kde=kde, ax=ax_hist2)
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--", label="Mean"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-", label="Median"
) # Add median to the histogram
ax_box2.set(xlabel="") # Remove x-axis label for the boxplot
ax_hist2.set_xlabel(feature, fontsize=14) # Add x-axis label for the histogram
ax_hist2.set_ylabel("Count", fontsize=14) # Add y-axis label for the histogram
ax_hist2.legend() # Add legend to the histogram
plt.title(f"Distribution of {feature}", fontsize=16) # Add title to the histogram
plt.show() # Show the plot
# Let's create a copy of the data
df_eda = df.copy()
1. What are the busiest months in the hotel?¶
# Count the number of bookings per month
monthly_bookings = df_eda["arrival_month"].value_counts().sort_index()
print("Distribution of 'arrival_month'")
print(df_eda["arrival_month"].describe())
labeled_barplot(df_eda, "arrival_month", perc=True)
print("-" * 100)
Distribution of 'arrival_month' count 36275.00000 mean 7.42365 std 3.06989 min 1.00000 25% 5.00000 50% 8.00000 75% 10.00000 max 12.00000 Name: arrival_month, dtype: float64
----------------------------------------------------------------------------------------------------
Busiest Months Observations:
- The top 5 busiest months according to the data are as following:
- October (10)
- September (9)
- August (8)
- June (6)
- December (12)
2. Which market segment do most of the guests come from?¶
# Count the number of bookings per market segment
market_segment_counts = df_eda["market_segment_type"].value_counts()
print("Distribution of 'market_segment_type'")
print(df_eda["market_segment_type"].describe())
labeled_barplot(df_eda, "market_segment_type", perc=True)
print("Market Segment Type Counts:")
print(market_segment_counts)
print("-" * 100)
Distribution of 'market_segment_type' count 36275 unique 5 top Online freq 23214 Name: market_segment_type, dtype: object
Market Segment Type Counts: market_segment_type Online 23214 Offline 10528 Corporate 2017 Complementary 391 Aviation 125 Name: count, dtype: int64 ----------------------------------------------------------------------------------------------------
Market Segment Observations:
- Mosts of the guests come from the Online Market Segment with 64%.
- Offline and Corporate Market segment come next with 29% and 5.6% respectively.
3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?¶
# Analyze the differences in room prices by market segment
print("Distribution of 'avg_price_per_room' by 'market_segment_type'")
display(df_eda.groupby("market_segment_type")["avg_price_per_room"].describe())
histogram_boxplot(df_eda, "avg_price_per_room", kde=True)
plt.figure(figsize=(12, 8))
sns.boxplot(
x="market_segment_type",
y="avg_price_per_room",
data=df_eda,
palette="Set2",
showmeans=True,
)
plt.title("Room Prices by Market Segment")
plt.xlabel("Market Segment")
plt.ylabel("Average Price per Room")
plt.xticks(rotation=45)
plt.show()
print("-" * 100)
Distribution of 'avg_price_per_room' by 'market_segment_type'
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| market_segment_type | ||||||||
| Aviation | 125.00000 | 100.70400 | 8.53836 | 79.00000 | 95.00000 | 95.00000 | 110.00000 | 110.00000 |
| Complementary | 391.00000 | 3.14176 | 15.51297 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 170.00000 |
| Corporate | 2017.00000 | 82.91174 | 23.69000 | 31.00000 | 65.00000 | 79.00000 | 95.00000 | 220.00000 |
| Offline | 10528.00000 | 91.63268 | 24.99560 | 12.00000 | 75.00000 | 90.00000 | 109.00000 | 540.00000 |
| Online | 23214.00000 | 112.25685 | 35.22032 | 0.00000 | 89.00000 | 107.10000 | 131.75000 | 375.50000 |
----------------------------------------------------------------------------------------------------
Observations of Room Prices by Market Segment:
- As we can see from the segmentation above, the top 3 average room prices go to Online, Aviation, and Offline market segments, with an average price of $111.34, $110.70, and $91.37 respectively.
- This means that on average online, aviation and offline customers pay more for their rooms on average than any other type of customer.
4. What percentage of bookings are canceled?¶
# Calculate the percentage of bookings canceled
cancellation_rate = df_eda["booking_status"].value_counts(normalize=True) * 100
print("Distribution of 'booking_status'")
print(df_eda["booking_status"].describe())
labeled_barplot(df_eda, "booking_status", perc=True)
print("Percentage of Bookings Canceled:")
print(cancellation_rate)
print("-" * 100)
Distribution of 'booking_status' count 36275 unique 2 top Not_Canceled freq 24390 Name: booking_status, dtype: object
Percentage of Bookings Canceled: booking_status Not_Canceled 67.23639 Canceled 32.76361 Name: proportion, dtype: float64 ----------------------------------------------------------------------------------------------------
Observations of Bookings Cancelled:
- As we can see, majority of the bookings end up not being cancelled. 67.24% to be precise.
- The rest 32.76% of the bookings were cancelled.
5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?¶
# Filter for repeating guests
repeating_guests = df_eda[df_eda["repeated_guest"] == 1]
# Calculate the cancellation rate among repeating guests
repeating_guest_cancellation_rate = (
repeating_guests["booking_status"].value_counts(normalize=True) * 100
)
print("Distribution of 'booking_status' for Repeating Guests")
print(repeating_guests["booking_status"].describe())
labeled_barplot(repeating_guests, "booking_status", perc=True)
print("Percentage of Repeating Guests Who Canceled:")
print(repeating_guest_cancellation_rate)
print("-" * 100)
Distribution of 'booking_status' for Repeating Guests count 930 unique 2 top Not_Canceled freq 914 Name: booking_status, dtype: object
Percentage of Repeating Guests Who Canceled: booking_status Not_Canceled 98.27957 Canceled 1.72043 Name: proportion, dtype: float64 ----------------------------------------------------------------------------------------------------
Observations of Repeating Guests Who Canceled:
- The observations are pretty similar. A great majority of 98.3% of repeating guests do not cancel their stay.
- Meanwhile the 1.7% that remain belong to repeating guests who canceled.
6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?¶
# Analyze the relationship between special requests and booking cancellation
print("Distribution of 'no_of_special_requests' by 'booking_status'")
display(df_eda.groupby("booking_status")["no_of_special_requests"].describe())
histogram_boxplot(df_eda, "no_of_special_requests", kde=True)
plt.figure(figsize=(10, 6))
sns.boxplot(
x="booking_status",
y="no_of_special_requests",
data=df_eda,
palette="pastel",
showmeans=True,
)
plt.title("Special Requests vs Booking Cancellation")
plt.xlabel("Booking Status")
plt.ylabel("Number of Special Requests")
plt.show()
# Additionally, calculate the average number of special requests for each booking status
special_requests_cancellation = df_eda.groupby("booking_status")[
"no_of_special_requests"
].mean()
print("Average Number of Special Requests by Booking Status:")
print(special_requests_cancellation)
print("-" * 100)
Distribution of 'no_of_special_requests' by 'booking_status'
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| booking_status | ||||||||
| Canceled | 11885.00000 | 0.33462 | 0.57435 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 2.00000 |
| Not_Canceled | 24390.00000 | 0.75855 | 0.83653 | 0.00000 | 0.00000 | 1.00000 | 1.00000 | 5.00000 |
Average Number of Special Requests by Booking Status: booking_status Canceled 0.33462 Not_Canceled 0.75855 Name: no_of_special_requests, dtype: float64 ----------------------------------------------------------------------------------------------------
Observations of Special Requests by Booking Status:
- We can see clearly that the number of special requests do not really affect the cancellation.
4. Data Preprocessing¶
- Missing value treatment (if needed)
- Feature engineering (if needed)
- Outlier detection and treatment (if needed)
- Preparing data for modeling
- Any other preprocessing steps (if needed)
# Let's create a copy of the data
df_preprocessing = df_eda.copy()
4.1 Missing Value & Duplicates Check and Treatment¶
Checking for Missing values
# Checking missing values across each column
missing_values = df_preprocessing.isnull().sum()
print("The number of missing values on each column of the data set is:")
missing_values
The number of missing values on each column of the data set is:
no_of_adults 0 no_of_children 0 no_of_weekend_nights 0 no_of_week_nights 0 type_of_meal_plan 0 required_car_parking_space 0 room_type_reserved 0 lead_time 0 arrival_year 0 arrival_month 0 arrival_date 0 market_segment_type 0 repeated_guest 0 no_of_previous_cancellations 0 no_of_previous_bookings_not_canceled 0 avg_price_per_room 0 no_of_special_requests 0 booking_status 0 dtype: int64
- We can confirm here there are no apparent missing values here in the data.
Checking for Duplicates
# Check for complete duplicate records
duplicate_records = df_preprocessing.duplicated().sum()
print("The number of duplicate values on the data set is:", duplicate_records)
The number of duplicate values on the data set is: 10275
# Identify all duplicate rows, including the first occurrence
all_duplicate_rows = df_preprocessing[df_preprocessing.duplicated(keep=False)]
# Display all duplicate rows
display(all_duplicate_rows)
| no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 224 | 2017 | 10 | 2 | Offline | 0 | 0 | 0 | 65.00000 | 0 | Not_Canceled |
| 2 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2018 | 2 | 28 | Online | 0 | 0 | 0 | 60.00000 | 0 | Canceled |
| 3 | 2 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 211 | 2018 | 5 | 20 | Online | 0 | 0 | 0 | 100.00000 | 0 | Canceled |
| 5 | 2 | 0 | 0 | 2 | Meal Plan 2 | 0 | Room_Type 1 | 346 | 2018 | 9 | 13 | Online | 0 | 0 | 0 | 115.00000 | 1 | Canceled |
| 7 | 2 | 0 | 1 | 3 | Meal Plan 1 | 0 | Room_Type 4 | 83 | 2018 | 12 | 26 | Online | 0 | 0 | 0 | 105.61000 | 1 | Not_Canceled |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 36261 | 1 | 0 | 2 | 4 | Meal Plan 1 | 0 | Room_Type 1 | 245 | 2018 | 7 | 6 | Offline | 0 | 0 | 0 | 110.00000 | 0 | Canceled |
| 36263 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 116 | 2018 | 2 | 28 | Online | 0 | 0 | 0 | 1.00000 | 0 | Not_Canceled |
| 36267 | 2 | 0 | 1 | 0 | Not Selected | 0 | Room_Type 1 | 49 | 2018 | 7 | 11 | Online | 0 | 0 | 0 | 93.15000 | 0 | Canceled |
| 36268 | 1 | 0 | 0 | 3 | Meal Plan 1 | 0 | Room_Type 1 | 166 | 2018 | 11 | 1 | Offline | 0 | 0 | 0 | 110.00000 | 0 | Canceled |
| 36274 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 207 | 2018 | 12 | 30 | Offline | 0 | 0 | 0 | 161.67000 | 0 | Not_Canceled |
13413 rows × 18 columns
# Check for duplicates in individual columns
for column in df_preprocessing:
print(f"Duplicates in {column}: {df_preprocessing[column].duplicated().sum()}")
print("-" * 50)
Duplicates in no_of_adults: 36270 -------------------------------------------------- Duplicates in no_of_children: 36269 -------------------------------------------------- Duplicates in no_of_weekend_nights: 36267 -------------------------------------------------- Duplicates in no_of_week_nights: 36257 -------------------------------------------------- Duplicates in type_of_meal_plan: 36271 -------------------------------------------------- Duplicates in required_car_parking_space: 36273 -------------------------------------------------- Duplicates in room_type_reserved: 36268 -------------------------------------------------- Duplicates in lead_time: 35923 -------------------------------------------------- Duplicates in arrival_year: 36273 -------------------------------------------------- Duplicates in arrival_month: 36263 -------------------------------------------------- Duplicates in arrival_date: 36244 -------------------------------------------------- Duplicates in market_segment_type: 36270 -------------------------------------------------- Duplicates in repeated_guest: 36273 -------------------------------------------------- Duplicates in no_of_previous_cancellations: 36266 -------------------------------------------------- Duplicates in no_of_previous_bookings_not_canceled: 36216 -------------------------------------------------- Duplicates in avg_price_per_room: 32345 -------------------------------------------------- Duplicates in no_of_special_requests: 36269 -------------------------------------------------- Duplicates in booking_status: 36273 --------------------------------------------------
- There are no duplicate rows in the data frame as seen by
duplicate_records. - The duplicates in each individual row are irrelevant as it is expected that not all columns would have an individual value phones share some of the same properties.
4.2 Feature Engineering¶
There are a few features we can implement to get more insights into the data. Here we will implement the following:
- We will create a new feature called
total_nightsrepresenting the total number of nights stayed (weekend + week nights). - We will also create a new binary feature
any_special_requestsindicating whether a customer made any special requests. - We will also be transforming
booking_statusto a binary feature.
# Check the unique values in the booking_status column
print(data["booking_status"].unique())
['Not_Canceled' 'Canceled']
# Creating new features for total number of nights stayed (weekend + weekday nights)
df_preprocessing["total_nights"] = (
df_preprocessing["no_of_weekend_nights"] + df_preprocessing["no_of_week_nights"]
)
# Creating a binary feature for whether a customer made any special requests
df_preprocessing["any_special_requests"] = df_preprocessing[
"no_of_special_requests"
].apply(lambda x: 1 if x > 0 else 0)
# Display the data types of the columns in the dataset
display(df_preprocessing.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 36275 entries, 0 to 36274 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_of_adults 36275 non-null int64 1 no_of_children 36275 non-null int64 2 no_of_weekend_nights 36275 non-null int64 3 no_of_week_nights 36275 non-null int64 4 type_of_meal_plan 36275 non-null object 5 required_car_parking_space 36275 non-null int64 6 room_type_reserved 36275 non-null object 7 lead_time 36275 non-null int64 8 arrival_year 36275 non-null int64 9 arrival_month 36275 non-null int64 10 arrival_date 36275 non-null int64 11 market_segment_type 36275 non-null object 12 repeated_guest 36275 non-null int64 13 no_of_previous_cancellations 36275 non-null int64 14 no_of_previous_bookings_not_canceled 36275 non-null int64 15 avg_price_per_room 36275 non-null float64 16 no_of_special_requests 36275 non-null int64 17 booking_status 36275 non-null object 18 total_nights 36275 non-null int64 19 any_special_requests 36275 non-null int64 dtypes: float64(1), int64(15), object(4) memory usage: 5.5+ MB
None
# Display the first few rows of the dataset
display(df_preprocessing.head())
| no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | total_nights | any_special_requests | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 224 | 2017 | 10 | 2 | Offline | 0 | 0 | 0 | 65.00000 | 0 | Not_Canceled | 3 | 0 |
| 1 | 2 | 0 | 2 | 3 | Not Selected | 0 | Room_Type 1 | 5 | 2018 | 11 | 6 | Online | 0 | 0 | 0 | 106.68000 | 1 | Not_Canceled | 5 | 1 |
| 2 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2018 | 2 | 28 | Online | 0 | 0 | 0 | 60.00000 | 0 | Canceled | 3 | 0 |
| 3 | 2 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 211 | 2018 | 5 | 20 | Online | 0 | 0 | 0 | 100.00000 | 0 | Canceled | 2 | 0 |
| 4 | 2 | 0 | 1 | 1 | Not Selected | 0 | Room_Type 1 | 48 | 2018 | 4 | 11 | Online | 0 | 0 | 0 | 94.50000 | 0 | Canceled | 2 | 0 |
4.3 Outlier Check and Treatment¶
# Making a list of all numerical variables ('int64', 'float64', 'complex')
num_cols = df_preprocessing.select_dtypes(
include=["int64", "float64", "complex"]
).columns
# Calculate the number of rows needed for the subplots
num_plots = len(num_cols)
num_rows = (num_plots // 3) + (num_plots % 3 > 0)
# Create subplots
plt.figure(figsize=(15, num_rows * 5))
for i, variable in enumerate(num_cols):
plt.subplot(num_rows, 3, i + 1)
sns.boxplot(data=df_preprocessing, x=variable)
plt.tight_layout(pad=2)
plt.show()
# Check for outliers in continuous variables
outliers = df_preprocessing[num_cols].describe()
print("\nSummary of the numerical features, including outliers:")
display(outliers)
Summary of the numerical features, including outliers:
| no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | required_car_parking_space | lead_time | arrival_year | arrival_month | arrival_date | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | total_nights | any_special_requests | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 36275.00000 | 36275.00000 | 36275.00000 | 36275.00000 | 36275.00000 | 36275.00000 | 36275.00000 | 36275.00000 | 36275.00000 | 36275.00000 | 36275.00000 | 36275.00000 | 36275.00000 | 36275.00000 | 36275.00000 | 36275.00000 |
| mean | 1.84496 | 0.10528 | 0.81072 | 2.20430 | 0.03099 | 85.23256 | 2017.82043 | 7.42365 | 15.59700 | 0.02564 | 0.02335 | 0.15341 | 103.42354 | 0.61966 | 3.01502 | 0.45480 |
| std | 0.51871 | 0.40265 | 0.87064 | 1.41090 | 0.17328 | 85.93082 | 0.38384 | 3.06989 | 8.74045 | 0.15805 | 0.36833 | 1.75417 | 35.08942 | 0.78624 | 1.78602 | 0.49796 |
| min | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 2017.00000 | 1.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
| 25% | 2.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 17.00000 | 2018.00000 | 5.00000 | 8.00000 | 0.00000 | 0.00000 | 0.00000 | 80.30000 | 0.00000 | 2.00000 | 0.00000 |
| 50% | 2.00000 | 0.00000 | 1.00000 | 2.00000 | 0.00000 | 57.00000 | 2018.00000 | 8.00000 | 16.00000 | 0.00000 | 0.00000 | 0.00000 | 99.45000 | 0.00000 | 3.00000 | 0.00000 |
| 75% | 2.00000 | 0.00000 | 2.00000 | 3.00000 | 0.00000 | 126.00000 | 2018.00000 | 10.00000 | 23.00000 | 0.00000 | 0.00000 | 0.00000 | 120.00000 | 1.00000 | 4.00000 | 1.00000 |
| max | 4.00000 | 10.00000 | 7.00000 | 17.00000 | 1.00000 | 443.00000 | 2018.00000 | 12.00000 | 31.00000 | 1.00000 | 13.00000 | 58.00000 | 540.00000 | 5.00000 | 24.00000 | 1.00000 |
Observations:
Outliers were identified in continuous variables with notable values in;
lead_timeno_of_previous_cancellationsno_of_previous_bookings_not_canceledavg_price_per_roomtotal_nights
Outlier Investigation: There are a few features that warrant further investigation concerning outliers.
- One notable example is the single high price outlier of over $500 in the
avg_price_per_roomfeature. - Another area requiring scrutiny is the instances where
no_of_children> 0 andno_of_adults= 0, as this seems inconsistent and could indicate data entry errors or unusual cases that need closer examination.
- One notable example is the single high price outlier of over $500 in the
Impact of Outlier Treatment: In previous iterations of this model, I observed that treating outliers resulted in two or more variables becoming perfectly collinear. This caused the logistic regression model to fail when computing coefficients for these variables.
Effect of Multicollinearity Treatment: Additionally, after addressing multicollinearity, as well as removing constant and duplicate columns, the model's performance declined compared to when multicollinearity was left untreated.
In addition, considering that our model needs to account for real-world scenarios, which can be highly unique, we will not be addressing outliers or multicollinearity prior to building the initial model for this project. This approach allows the model to capture the full variability of the data, reflecting the diverse situations it may encounter in practice.
4.4 Preparing Data for Modeling¶
# Let's create another copy of the data
dfmodeling = df_preprocessing.copy()
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
def identify_and_prepare_data(
df, target_column, exclude_columns=None, multicollinearity_threshold=0.95
):
"""
Automatically identify categorical and derived columns, then prepare the data for modeling.
Parameters:
df (pd.DataFrame): The input DataFrame.
target_column (str): The name of the target variable column.
exclude_columns (list): List of columns to exclude from processing (e.g., ID columns).
multicollinearity_threshold (float): Threshold for identifying derived columns with high multicollinearity.
Returns:
X (pd.DataFrame): The features DataFrame after dropping unnecessary columns.
y (pd.Series): The target variable series.
categorical_columns (list): List of identified categorical columns.
derived_columns (list): List of identified derived columns.
"""
if exclude_columns is None:
exclude_columns = []
# Identify categorical columns
categorical_columns = df.select_dtypes(
include=["object", "category"]
).columns.tolist()
# Remove the target column and excluded columns from categorical columns list
categorical_columns = [
col
for col in categorical_columns
if col not in [target_column] + exclude_columns
]
# Encode categorical columns
for col in categorical_columns:
df[f"{col}_encoded"] = LabelEncoder().fit_transform(df[col])
# Drop original categorical columns after encoding
df = df.drop(columns=categorical_columns)
# Identify derived columns based on multicollinearity
# Ensure we only consider numeric columns for the correlation matrix
numeric_df = df.select_dtypes(include=[np.number])
correlation_matrix = numeric_df.corr().abs()
upper_triangle = correlation_matrix.where(
np.triu(np.ones(correlation_matrix.shape), k=1).astype(bool)
)
# Identify columns with high correlation to drop
derived_columns = [
"total_nights"
# column
# for column in upper_triangle.columns
# if any(upper_triangle[column] > multicollinearity_threshold)
]
# Create the list of columns to drop
columns_to_drop = [target_column] + derived_columns + exclude_columns
# Drop unnecessary columns
X = df.drop(columns=columns_to_drop)
y = df[target_column].apply(lambda x: 1 if x == "Canceled" else 0)
return X, y, categorical_columns, derived_columns
# Usage Example
# Assuming df_preprocessing is your DataFrame and "booking_status" is the target variable
X, y, identified_categorical_columns, identified_derived_columns = (
identify_and_prepare_data(
df=df_preprocessing,
target_column="booking_status",
# exclude_columns=["booking_id"], # Example of an ID column to exclude
)
)
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Outputs: Checking the updated dataset and identified columns
print("First few rows of the training set:")
display(X_train.head())
print("\nFirst few entries of the target variable in the training set:")
display(y_train.head())
print("\nIdentified Categorical Columns:")
print(identified_categorical_columns)
print("\nIdentified Derived Columns:")
print(identified_derived_columns)
First few rows of the training set:
| no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | required_car_parking_space | lead_time | arrival_year | arrival_month | arrival_date | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | any_special_requests | type_of_meal_plan_encoded | room_type_reserved_encoded | market_segment_type_encoded | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 25629 | 2 | 1 | 2 | 1 | 0 | 26 | 2017 | 10 | 17 | 0 | 0 | 0 | 161.00000 | 0 | 0 | 0 | 0 | 4 |
| 14473 | 2 | 1 | 1 | 1 | 0 | 98 | 2018 | 7 | 16 | 0 | 0 | 0 | 121.50000 | 2 | 1 | 0 | 0 | 4 |
| 23720 | 2 | 0 | 0 | 3 | 0 | 433 | 2018 | 9 | 8 | 0 | 0 | 0 | 70.00000 | 0 | 0 | 0 | 0 | 3 |
| 5843 | 2 | 0 | 2 | 5 | 0 | 195 | 2018 | 8 | 8 | 0 | 0 | 0 | 72.25000 | 0 | 0 | 0 | 0 | 3 |
| 18709 | 1 | 0 | 0 | 2 | 0 | 188 | 2018 | 6 | 15 | 0 | 0 | 0 | 130.00000 | 0 | 0 | 0 | 0 | 3 |
First few entries of the target variable in the training set:
25629 0 14473 0 23720 1 5843 0 18709 1 Name: booking_status, dtype: int64
Identified Categorical Columns: ['type_of_meal_plan', 'room_type_reserved', 'market_segment_type'] Identified Derived Columns: ['total_nights']
# # Data Transformation: Encoding categorical variables
# dfmodeling["type_of_meal_plan_encoded"] = (
# dfmodeling["type_of_meal_plan"].astype("category").cat.codes
# )
# dfmodeling["room_type_reserved_encoded"] = (
# dfmodeling["room_type_reserved"].astype("category").cat.codes
# )
# dfmodeling["market_segment_type_encoded"] = (
# dfmodeling["market_segment_type"].astype("category").cat.codes
# )
# # Separating features and target variable
# X = dfmodeling.drop(
# columns=[
# "booking_status",
# "type_of_meal_plan",
# "room_type_reserved",
# "market_segment_type",
# "total_nights",
# ]
# )
# y = dfmodeling["booking_status"].apply(
# lambda x: 1 if x == "Canceled" else 0
# ) # Encoding target variable
# # Splitting the dfmodeling into training and testing sets
# X_train, X_test, y_train, y_test = train_test_split(
# X, y, test_size=0.2, random_state=42
# )
# # Outputs: Checking the updated dfmodelingset, outliers, and splits
# print("First few rows of the training dfmodeling:")
# display(X_train.head())
# print("\nFirst few entries of the target variable in the training set:")
# display(y_train.head())
# Display the shapes of the training and testing sets
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("\nPercentage of classes in training set:")
display(y_train.value_counts(normalize=True))
print("\nPercentage of classes in test set:")
display(y_test.value_counts(normalize=True))
Shape of Training set : (29020, 18) Shape of test set : (7255, 18) Percentage of classes in training set:
booking_status 0 0.67371 1 0.32629 Name: proportion, dtype: float64
Percentage of classes in test set:
booking_status 0 0.66699 1 0.33301 Name: proportion, dtype: float64
4.5 Checking Multicollinearity (Optional if Outlier treatments were performed)¶
- As mentioned before, in previous iterations of this model, I observed that treating outliers resulted in two or more variables becoming perfectly collinear. This caused the logistic regression model to fail when computing coefficients for these variables.
- Effect of Multicollinearity Treatment: Additionally, after addressing multicollinearity, as well as removing constant and duplicate columns, the model's performance declined compared to when multicollinearity was left untreated.
In addition, considering that our model needs to account for real-world scenarios, which can be highly unique, we will not be addressing outliers or multicollinearity prior to building the initial model for this project. This approach allows the model to capture the full variability of the data, reflecting the diverse situations it may encounter in practice.
5. Logistic Regression Model Building¶
5.1 Model Evaluation Criteria¶
In building our categorical regression model, it is crucial to consider the types of errors the model might make:
False Negative: Predicting that a booking will not be canceled when, in reality, it is canceled. This scenario could lead to resource wastage, additional distribution costs, and lower profit margins due to last-minute attempts to resell the room.
False Positive: Predicting that a booking will be canceled when, in reality, it is not. This could result in the hotel not adequately preparing for a guest's arrival, potentially leading to dissatisfaction and damage to the hotel's brand equity.
Which Case is More Important?¶
Both types of errors are significant:
- False Negatives can result in direct financial losses and operational inefficiencies.
- False Positives can damage customer satisfaction and brand reputation.
How to Reduce Losses?¶
- To minimize these risks, we aim to maximize the
F1 Scoreof our model. The F1 Score is a balanced metric that considers bothprecisionandrecall, making it particularly useful in scenarios where both false positives and false negatives are costly. By maximizing the F1 Score, we increase our chances of minimizing both types of errors, thereby reducing the potential negative impact on the hotel’s operations and profitability. - f1_score is computed as $$f1\_score = \frac{2 * Precision * Recall}{Precision + Recall}$$
To streamline our workflow and avoid repeating code for each model, we will create two essential functions that will handle the calculation of different metrics and the generation of a confusion matrix.¶
model_performance_classification_statsmodelsFunction: This function is designed to evaluate the performance of classification models. By using this function, we can consistently assess the effectiveness of our models across various metrics, ensuring a thorough analysis without redundant code.confusion_matrix_statsmodelsFunction: This function will be used to plot the confusion matrix, a crucial tool in understanding the types of errors our models are making. By visualizing the confusion matrix, we can gain insights into the performance of our models in distinguishing between different classes, helping us to identify areas for improvement.
# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification_statsmodels(
model, predictors, target, threshold=0.5
):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# checking which probabilities are greater than threshold
pred_temp = model.predict(predictors) > threshold
# rounding off the above values to get classes
pred = np.round(pred_temp)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return df_perf
# defining a function to plot the confusion_matrix of a classification model
def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix with percentages and print TN, FP, FN, TP
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
y_pred = model.predict(predictors) > threshold
cm = confusion_matrix(target, y_pred)
TN, FP, FN, TP = cm.ravel()
# Display counts for TN, FP, FN, TP
print(f"True Negatives (TN): {TN}")
print(f"False Positives (FP): {FP}")
print(f"False Negatives (FN): {FN}")
print(f"True Positives (TP): {TP}")
labels = np.asarray(
[
[f"TN: {TN}\n{TN / cm.sum():.2%}", f"FP: {FP}\n{FP / cm.sum():.2%}"],
[f"FN: {FN}\n{FN / cm.sum():.2%}", f"TP: {TP}\n{TP / cm.sum():.2%}"],
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
plt.show()
5.2 Logistic Regression Model¶
We will now perform logistic regression using statsmodels, a Python module that offers comprehensive functions for estimating a wide range of statistical models, conducting statistical tests, and exploring statistical data.
By utilizing statsmodels for logistic regression, we can rigorously assess the statistical validity of our model. This approach allows us to identify the significant predictors by examining the p-values associated with each predictor variable, helping us understand which factors have a meaningful impact on the outcome.
This method will not only help us build a robust logistic regression model but also provide insights into the significance of each predictor, ensuring that our model is both accurate and interpretable.
# Building the logistic regression model
logit = sm.Logit(y_train, X_train.astype(float))
lg = logit.fit(disp=False)
print(lg.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 29020
Model: Logit Df Residuals: 29002
Method: MLE Df Model: 17
Date: Sat, 03 Aug 2024 Pseudo R-squ.: 0.3097
Time: 01:50:41 Log-Likelihood: -12651.
converged: True LL-Null: -18327.
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
no_of_adults 0.0401 0.034 1.169 0.242 -0.027 0.107
no_of_children -0.0503 0.043 -1.164 0.244 -0.135 0.034
no_of_weekend_nights 0.1338 0.018 7.318 0.000 0.098 0.170
no_of_week_nights 0.0535 0.011 4.700 0.000 0.031 0.076
required_car_parking_space -1.5486 0.130 -11.926 0.000 -1.803 -1.294
lead_time 0.0145 0.000 65.432 0.000 0.014 0.015
arrival_year -0.0039 8.16e-05 -48.006 0.000 -0.004 -0.004
arrival_month -0.0644 0.006 -11.557 0.000 -0.075 -0.053
arrival_date 0.0023 0.002 1.289 0.197 -0.001 0.006
repeated_guest -1.5182 0.510 -2.974 0.003 -2.519 -0.518
no_of_previous_cancellations 0.1487 0.072 2.079 0.038 0.009 0.289
no_of_previous_bookings_not_canceled -0.0259 0.071 -0.363 0.717 -0.166 0.114
avg_price_per_room 0.0196 0.001 31.034 0.000 0.018 0.021
no_of_special_requests -1.0567 0.055 -19.318 0.000 -1.164 -0.949
any_special_requests -0.4749 0.075 -6.358 0.000 -0.621 -0.329
type_of_meal_plan_encoded 0.1087 0.015 7.022 0.000 0.078 0.139
room_type_reserved_encoded -0.0850 0.014 -5.908 0.000 -0.113 -0.057
market_segment_type_encoded 1.2783 0.039 32.879 0.000 1.202 1.354
========================================================================================================
Observations
Pseudo R-squared indicates that approximately 31% of the variance in the dependent variable (booking_status) is explained by the model. While this is not a very high value, it suggests a decent level of explanatory power for a logistic regression model.
Variables with p-values less than 0.05 are considered statistically significant. These include:
required_car_parking_spacelead_timearrival_yeararrival_monthrepeated_guestavg_price_per_roomno_of_special_requestsmarket_segment_type_encoded
Variables with p-values greater than 0.05 are considered statistically insignificant. These include:
no_of_adultsno_of_childrenarrival_dateno_of_previous_bookings_not_canceled
It appears that some variables, such as
no_of_weekend_nights,no_of_week_nights, andtotal_nights, have resulted inNaNvalues in the output. This could be due to multicollinearity or perfect separation in the data.Will investigate further by checking for multicollinearity or considering dropping these variables from the model.
Negative values of the coefficient show that the probability of a booking being cancelled increases with the decrease of the corresponding attribute value.
Positive values of the coefficient show that the probability of a booking being canceled increases with the increase of the corresponding attribute value.
# Plot the confusion matrix for the training data
confusion_matrix_statsmodels(lg, X_train, y_train)
# Print the performance metrics for the training data
print("Training performance:")
logit_performance_metrics_train = model_performance_classification_statsmodels(
lg, X_train, y_train
)
logit_performance_metrics_train
True Negatives (TN): 17543 False Positives (FP): 2008 False Negatives (FN): 3669 True Positives (TP): 5800
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.80438 | 0.61253 | 0.74283 | 0.67141 |
# Plot the confusion matrix for the testing data
confusion_matrix_statsmodels(lg, X_test, y_test)
# Print the performance metrics for the training data
print("Testing performance:")
logit_performance_metrics_test = model_performance_classification_statsmodels(
lg, X_test, y_test
)
logit_performance_metrics_test
True Negatives (TN): 4352 False Positives (FP): 487 False Negatives (FN): 936 True Positives (TP): 1480
Testing performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.80386 | 0.61258 | 0.75241 | 0.67534 |
Observations:
Training data:
- True Negatives (TN): The model correctly predicted 17,543 bookings as "Not Canceled," which accounts for about 60.45% of the total predictions. This shows that the model is quite effective at identifying bookings that are not going to be canceled.
- False Positives (FP): There are 2,008 instances where the model incorrectly predicted a booking as "Canceled" when it was actually "Not Canceled." This represents about 6.92% of the total predictions. While this number isn't excessively high, it indicates some room for improvement in reducing false alarms, which could lead to unnecessary operational changes or customer dissatisfaction.
- False Negatives (FN): The model incorrectly predicted 3,669 bookings as "Not Canceled" when they were actually "Canceled." This is about 12.64% of the total predictions. This type of error is more concerning, as it could lead to loss of revenue and operational inefficiencies due to unanticipated cancellations.
- True Positives (TP): The model correctly predicted 5,800 bookings as "Canceled," accounting for about 19.99% of the total predictions. This is a significant portion, indicating that the model is fairly good at identifying cancellations. Performance Metrics Observations:
- Accuracy (80.44%): The overall accuracy of the model is 80.44%, meaning that the model correctly classified about 80% of the cases. This is a strong performance, suggesting that the model is generally reliable.
- Recall (61.25%): The recall is 61.25%, which means that the model correctly identified 61.25% of all actual cancellations. This metric shows that while the model is decent at catching cancellations, there is still room to improve, particularly in reducing false negatives.
- Precision (74.28%): The precision is 74.28%, indicating that when the model predicts a cancellation, it is correct about 74.28% of the time. This is a good precision score, showing that the model is fairly accurate when it predicts a cancellation.
- F1 Score (67.14%):The F1 score, which balances precision and recall, is 67.14%. This suggests that the model has a reasonably balanced performance but could benefit from improvements in both recall and precision to enhance its overall effectiveness.
Testing Data:
- We can see that the results between the training and the testing data are pretty comparable which is a good indicator for our model.
Checking Multicollinearity¶
There are different ways of detecting (or testing for) multicollinearity. One such way is using the Variation Inflation Factor (VIF).
Variance Inflation factor: Variance inflation factors measure the inflation in the variances of the regression coefficients estimates due to collinearities that exist among the predictors. It is a measure of how much the variance of the estimated regression coefficient $\beta_k$ is "inflated" by the existence of correlation among the predictor variables in the model.
General Rule of thumb:
- If VIF is 1 then there is no correlation among the κth predictor and the remaining predictor variables, and hence the variance of $\beta_k$ is not inflated at all
- If VIF exceeds 5, we say there is moderate multicollinearity
- If VIF is equal or exceeding 10, it shows signs of high multi-collinearity
The purpose of the analysis should dictate which threshold to use.
# Function to calculate VIF for each feature
def calculate_vif(X):
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [
variance_inflation_factor(X.values, i) for i in range(X.shape[1])
]
return vif_data
# Function to iteratively remove features with high VIF
def remove_high_vif_features(X, threshold=5.0):
while True:
vif = calculate_vif(X)
max_vif = vif["VIF"].max()
if max_vif > threshold:
feature_to_remove = vif.loc[vif["VIF"].idxmax(), "feature"]
print(
f"Removing feature with high VIF: {feature_to_remove} (VIF = {max_vif:.2f})"
)
X = X.drop(columns=[feature_to_remove])
else:
break
return X, vif
# Calculate VIF for the features
vif_before = calculate_vif(X_train)
print(vif_before)
feature VIF 0 no_of_adults 17.65198 1 no_of_children 1.34301 2 no_of_weekend_nights 1.97664 3 no_of_week_nights 3.76298 4 required_car_parking_space 1.06870 5 lead_time 2.23127 6 arrival_year 53.55953 7 arrival_month 7.23461 8 arrival_date 4.20731 9 repeated_guest 1.69811 10 no_of_previous_cancellations 1.35802 11 no_of_previous_bookings_not_canceled 1.61317 12 avg_price_per_room 15.57239 13 no_of_special_requests 6.53970 14 any_special_requests 7.34724 15 type_of_meal_plan_encoded 1.41589 16 room_type_reserved_encoded 1.91665 17 market_segment_type_encoded 44.68421
# Iteratively remove features with high VIF
X_train_vif_reduced, final_vif = remove_high_vif_features(X_train)
print("Final VIF values after removing high multicollinearity features:")
print(final_vif)
Removing feature with high VIF: arrival_year (VIF = 53.56)
Removing feature with high VIF: market_segment_type_encoded (VIF = 26.94)
Removing feature with high VIF: no_of_adults (VIF = 13.12)
Removing feature with high VIF: avg_price_per_room (VIF = 8.32)
Removing feature with high VIF: any_special_requests (VIF = 7.09)
Final VIF values after removing high multicollinearity features:
feature VIF
0 no_of_children 1.23818
1 no_of_weekend_nights 1.88363
2 no_of_week_nights 3.29695
3 required_car_parking_space 1.05835
4 lead_time 2.12983
5 arrival_month 4.55520
6 arrival_date 3.19793
7 repeated_guest 1.53025
8 no_of_previous_cancellations 1.34571
9 no_of_previous_bookings_not_canceled 1.60304
10 no_of_special_requests 1.72007
11 type_of_meal_plan_encoded 1.27720
12 room_type_reserved_encoded 1.53312
- Now that Multicollinearity is fixed, we will re-build the model
X_trained_multicoll = X_train_vif_reduced
logit2_multicoll = sm.Logit(y_train, X_trained_multicoll.astype(float))
lg2 = logit2_multicoll.fit(disp=False)
# Plot the confusion matrix for the training data
confusion_matrix_statsmodels(lg2, X_trained_multicoll, y_train)
# Print the performance metrics for the new training data
print("Training performance:")
multicoll_performance_metrics_train = model_performance_classification_statsmodels(
lg2, X_trained_multicoll, y_train
)
display(multicoll_performance_metrics_train)
# Align the testing data with the reduced training data (after removing high VIF features)
X_test_vif_reduced = X_test[X_train_vif_reduced.columns]
X_test_multicoll = X_test_vif_reduced
# Predict on the testing data
y_test_pred = lg2.predict(X_test_multicoll.astype(float))
# Plot the confusion matrix for the testing data
confusion_matrix_statsmodels(lg2, X_test_multicoll, y_test)
# Print the performance metrics for the testing data
print("Testing performance:")
multicoll_performance_metrics_train = model_performance_classification_statsmodels(
lg2, X_test_multicoll, y_test
)
display(multicoll_performance_metrics_train)
True Negatives (TN): 16938 False Positives (FP): 2613 False Negatives (FN): 4474 True Positives (TP): 4995
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.75579 | 0.52751 | 0.65655 | 0.58500 |
True Negatives (TN): 4206 False Positives (FP): 633 False Negatives (FN): 1154 True Positives (TP): 1262
Testing performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.75369 | 0.52235 | 0.66596 | 0.58548 |
Observations:
- The second Model actually had a worst performance.
- It would still be recommended to choose the second one since multicollinearity has been addressed by then.
Treating High P-values¶
For other attributes present in the data, the p-values are high only for few dummy variables and since only one (or some) of the categorical levels have a high p-value we will drop them iteratively as sometimes p-values change after dropping a variable. So, we'll not drop all variables at once.
Instead, we will do the following repeatedly using a loop:
- Build a model, check the p-values of the variables, and drop the column with the highest p-value.
- Create a new model without the dropped feature, check the p-values of the variables, and drop the column with the highest p-value.
- Repeat the above two steps till there are no columns with p-value > 0.05.
Note: The above process can also be done manually by picking one variable at a time that has a high p-value, dropping it, and building a model again. But that might be a little tedious and using a function will be more efficient.
def remove_high_p_value_variables(X_train, y_train, significance_level=0.05):
"""
Iteratively remove variables with p-values greater than the significance level from the logistic regression model.
Displays the initial and final p-values before and after the iteration.
Parameters:
X_train (pd.DataFrame): The features DataFrame for the training set.
y_train (pd.Series): The target variable for the training set.
significance_level (float): The p-value threshold for keeping variables (default is 0.05).
Returns:
final_selected_features (pd.DataFrame): The DataFrame of selected features after removing variables with high p-values.
final_p_values (pd.Series): The final p-values after the iteration.
"""
# Initial list of columns
cols = X_train.columns.tolist()
# Fit the initial logistic regression model
initial_model = sm.Logit(y_train, X_train.astype(float)).fit(disp=False)
# Display initial p-values
initial_p_values = initial_model.pvalues
print("Initial p-values:")
display(initial_p_values)
while len(cols) > 0:
# Define the training set with the current list of columns
X_train_aux = X_train[cols]
# Fit the logistic regression model
model = sm.Logit(y_train, X_train_aux.astype(float)).fit(disp=False)
# Get the p-values and find the maximum p-value
p_values = model.pvalues
max_p_value = max(p_values)
# Name of the variable with the highest p-value
feature_with_p_max = p_values.idxmax()
# Check if the highest p-value exceeds the significance level
if max_p_value > significance_level:
print(f"Dropping '{feature_with_p_max}' with p-value {max_p_value:.5f}")
cols.remove(
feature_with_p_max
) # Remove the feature with the highest p-value
else:
break # Stop if all p-values are below the significance level
# Selected features after the iterative process
final_selected_features = X_train[cols]
# Fit the final model with the selected features
final_model = sm.Logit(y_train, final_selected_features.astype(float)).fit(
disp=False
)
# Display final p-values
final_p_values = final_model.pvalues
print("\nFinal p-values after iteration:")
display(final_p_values)
return final_selected_features, final_p_values
# Running the function
X_train_pvalues, final_p_values = remove_high_p_value_variables(
X_trained_multicoll, y_train, significance_level=0.05
)
Initial p-values:
no_of_children 0.00000 no_of_weekend_nights 0.00552 no_of_week_nights 0.00000 required_car_parking_space 0.00000 lead_time 0.00000 arrival_month 0.00000 arrival_date 0.00000 repeated_guest 0.00000 no_of_previous_cancellations 0.00228 no_of_previous_bookings_not_canceled 0.25613 no_of_special_requests 0.00000 type_of_meal_plan_encoded 0.00000 room_type_reserved_encoded 0.00000 dtype: float64
Dropping 'no_of_previous_bookings_not_canceled' with p-value 0.25613 Final p-values after iteration:
no_of_children 0.00000 no_of_weekend_nights 0.00554 no_of_week_nights 0.00000 required_car_parking_space 0.00000 lead_time 0.00000 arrival_month 0.00000 arrival_date 0.00000 repeated_guest 0.00000 no_of_previous_cancellations 0.00227 no_of_special_requests 0.00000 type_of_meal_plan_encoded 0.00000 room_type_reserved_encoded 0.00000 dtype: float64
logit3_pvalues = sm.Logit(y_train, X_train_pvalues.astype(float))
lg3 = logit3_pvalues.fit(disp=False)
# Plot the confusion matrix for the training data
confusion_matrix_statsmodels(lg3, X_train_pvalues, y_train)
# Print the performance metrics for the new training data
print("Training performance:")
pvalues_performance_metrics_train = model_performance_classification_statsmodels(
lg3, X_train_pvalues, y_train
)
display(pvalues_performance_metrics_train)
# Align the testing data with the reduced training data features (after removing high p-value variables)
X_test_pvalues = X_test_multicoll[X_train_pvalues.columns]
# Predict on the testing data and evaluate the model performance
# Plot the confusion matrix for the testing data
confusion_matrix_statsmodels(lg3, X_test_pvalues, y_test)
# Print the performance metrics for the testing data
print("Testing performance:")
pvalues_performance_metrics_test = model_performance_classification_statsmodels(
lg3, X_test_pvalues, y_test
)
display(pvalues_performance_metrics_test)
True Negatives (TN): 16939 False Positives (FP): 2612 False Negatives (FN): 4472 True Positives (TP): 4997
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.75589 | 0.52772 | 0.65672 | 0.58520 |
True Negatives (TN): 4206 False Positives (FP): 633 False Negatives (FN): 1154 True Positives (TP): 1262
Testing performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.75369 | 0.52235 | 0.66596 | 0.58548 |
X_train_final = X_train_pvalues
X_test_final = X_test_pvalues
# Building the logistic regression model
logit_final = logit3_pvalues
lg_final = logit_final.fit(disp=False)
print(lg_final.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 29020
Model: Logit Df Residuals: 29008
Method: MLE Df Model: 11
Date: Sat, 03 Aug 2024 Pseudo R-squ.: 0.2094
Time: 01:50:44 Log-Likelihood: -14490.
converged: True LL-Null: -18327.
Covariance Type: nonrobust LLR p-value: 0.000
================================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------------------
no_of_children 0.3098 0.038 8.253 0.000 0.236 0.383
no_of_weekend_nights 0.0457 0.016 2.774 0.006 0.013 0.078
no_of_week_nights -0.0525 0.010 -5.291 0.000 -0.072 -0.033
required_car_parking_space -1.2321 0.123 -9.983 0.000 -1.474 -0.990
lead_time 0.0115 0.000 59.742 0.000 0.011 0.012
arrival_month -0.1314 0.004 -33.002 0.000 -0.139 -0.124
arrival_date -0.0223 0.001 -15.458 0.000 -0.025 -0.019
repeated_guest -3.3823 0.408 -8.299 0.000 -4.181 -2.584
no_of_previous_cancellations 0.1992 0.065 3.052 0.002 0.071 0.327
no_of_special_requests -0.9907 0.023 -42.449 0.000 -1.036 -0.945
type_of_meal_plan_encoded 0.1495 0.014 10.831 0.000 0.122 0.177
room_type_reserved_encoded 0.1678 0.011 14.741 0.000 0.145 0.190
================================================================================================
Observations:
- Now that none of the categorical features has Hihg P-values > 0.05, we will consider the features in
X_train_finalas the final ones andlg_finalas the final model.
Coefficient interpretations¶
- Coefficients of;
no_of_children,no_of_weekend_nights,lead_time,no_of_previous_cancellations,type_of_meal_plan_encoded, androom_type_reserved_encodedare positive; an increase in these will lead to an increase in chances of a booking being cancelled. - Coefficients of;
no_of_week_nights,required_car_parking_space,arrival_month,arrival_date,repeated_guest, andno_of_special_requestsare is negative; an increase in these will lead to a decrease in chances of a booking being cancelled.
Converting coefficients to odds¶
- The coefficients ($\beta$'s) of the logistic regression model are in terms of $log(odds)$ and to find the odds, we have to take the exponential of the coefficients
- Therefore, $odds = exp(\beta)$
- The percentage change in odds is given as $(exp(\beta) - 1) * 100$
# converting coefficients to odds
odds = np.exp(lg_final.params)
# finding the percentage change
perc_change_odds = (np.exp(lg_final.params) - 1) * 100
# removing limit from number of columns to display
pd.set_option("display.max_columns", None)
# adding the odds to a dataframe
pd.DataFrame(
{"Odds": odds, "Change_odd%": perc_change_odds}, index=X_train_final.columns
).T
| no_of_children | no_of_weekend_nights | no_of_week_nights | required_car_parking_space | lead_time | arrival_month | arrival_date | repeated_guest | no_of_previous_cancellations | no_of_special_requests | type_of_meal_plan_encoded | room_type_reserved_encoded | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Odds | 1.36309 | 1.04680 | 0.94883 | 0.29167 | 1.01159 | 0.87684 | 0.97798 | 0.03397 | 1.22039 | 0.37133 | 1.16121 | 1.18269 |
| Change_odd% | 36.30887 | 4.67955 | -5.11741 | -70.83330 | 1.15924 | -12.31648 | -2.20164 | -96.60299 | 22.03925 | -62.86743 | 16.12083 | 18.26885 |
Coefficient interpretation observations and explanations:
Positive Influence on Cancellation:
no_of_children,no_of_weekend_nights,lead_time,no_of_previous_cancellations,type_of_meal_plan_encoded, androom_type_reserved_encodedall increase the odds of cancellation, meaning these factors make cancellations more likely as their values increase.
Negative Influence on Cancellation:
no_of_week_nights,required_car_parking_space,arrival_month,arrival_date,repeated_guest, andno_of_special_requestsall decrease the odds of cancellation, meaning these factors make cancellations less likely as their values increase.
Key Takeaways:
repeated_guest: This has a very strong negative impact on the likelihood of cancellation, indicating that repeat guests are far less likely to cancel their bookings. This suggests the importance of fostering guest loyalty.required_car_parking_space: The requirement of a parking space is associated with a significant decrease in cancellation risk, potentially indicating a more committed travel plan.no_of_childrenandno_of_weekend_nights: Both of these are associated with an increased risk of cancellation, suggesting that family bookings and weekend stays might require more flexible cancellation policies or additional customer support to reduce the risk.
Model performance evaluation¶
Checking model performance on training set¶
# Plot the confusion matrix for the training data
confusion_matrix_statsmodels(lg_final, X_train_final, y_train)
# Print the performance metrics for the new training data
print("Training performance:")
log_reg_model_train_perf = model_performance_classification_statsmodels(
lg_final, X_train_final, y_train
)
display(log_reg_model_train_perf)
True Negatives (TN): 16939 False Positives (FP): 2612 False Negatives (FN): 4472 True Positives (TP): 4997
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.75589 | 0.52772 | 0.65672 | 0.58520 |
ROC-AUC on Training Set¶
# Calculate ROC-AUC for the training set
logit_roc_auc_train = roc_auc_score(y_train, lg_final.predict(X_train_final))
# Get the false positive rate and true positive rate
fpr, tpr, thresholds = roc_curve(y_train, lg_final.predict(X_train_final))
# Plot the ROC curve
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--") # Diagonal line
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.01])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic (ROC) Curve")
plt.legend(loc="lower right")
plt.show()
print("Train Set ROC-AUC:", logit_roc_auc_train)
Train Set ROC-AUC: 0.8008376066777733
Interpretation of the ROC Curve and AUC (0.80)
- ROC-AUC Score (0.80):
- The ROC-AUC score of 0.80 is quite strong, indicating that the model has a good ability to distinguish between cancellations and non-cancellations. Specifically, an AUC of 0.80 means that there is an 80% chance that the model will rank a randomly chosen canceled booking higher than a randomly chosen non-canceled booking.
- However, the confusion matrix and performance metrics suggest that while the model is good at ranking predictions, it struggles with recall (capturing actual cancellations).
Model Performance Improvement¶
- Since our aim is to increase F1, we will aim to find the optimal threshold where the difference between the True Positive Rate (TPR) and False Positive Rate (FPR) is maximized. This will balance the trade-offs between precision and recall therefore increasing F1.
- Let's see if the recall score can be improved further, by changing the model threshold using AUC-ROC Curve.
# Calculate the false positive rate (fpr), true positive rate (tpr), and thresholds
fpr, tpr, thresholds = roc_curve(y_train, lg_final.predict(X_train_final))
# Find the optimal index where the difference between TPR and FPR is maximized
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(f"Optimal Threshold: {optimal_threshold_auc_roc:.4f}")
Optimal Threshold: 0.3643
# Create the confusion matrix using the optimal threshold
confusion_matrix_statsmodels(
lg_final, X_train_final, y_train, threshold=optimal_threshold_auc_roc
)
True Negatives (TN): 14555 False Positives (FP): 4996 False Negatives (FN): 2686 True Positives (TP): 6783
# Check the model performance with the new threshold
log_reg_model_train_perf_threshold_auc_roc = (
model_performance_classification_statsmodels(
lg_final, X_train_final, y_train, threshold=optimal_threshold_auc_roc
)
)
print("Training performance with optimal threshold:")
display(log_reg_model_train_perf_threshold_auc_roc)
Training performance with optimal threshold:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.73529 | 0.71634 | 0.57586 | 0.63846 |
Let's use Precision-Recall curve and see if we can find a better threshold¶
# Predict the probabilities on the training data
y_scores = lg_final.predict(X_train_final)
# Generate the precision, recall, and threshold values
prec, rec, tre = precision_recall_curve(y_train, y_scores)
# Function to plot precision and recall against the thresholds
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
plt.plot(thresholds, recalls[:-1], "g--", label="Recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
plt.title("Precision and Recall vs Threshold")
# Plot the Precision-Recall curve
plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
- At threshold around 0.62 we will get equal precision and recall but taking a step back and selecting value around 0.34 will provide a higher recall and a good precision.
- We can further confirm by using the following to find the threshold that gives the maximum F1 score.
# Calculate the F1 scores for each threshold
f1_scores = 2 * (prec * rec) / (prec + rec)
# Find the threshold that gives the maximum F1 score
optimal_idx = np.argmax(f1_scores)
optimal_threshold_curve = tre[optimal_idx]
print("Optimal Threshold based on Precision-Recall Curve:", optimal_threshold_curve)
Optimal Threshold based on Precision-Recall Curve: 0.3463375314218031
# Use the optimal threshold to create a confusion matrix
confusion_matrix_statsmodels(
lg_final, X_train_final, y_train, threshold=optimal_threshold_curve
)
# Check the model performance with the new threshold
log_reg_model_train_perf_threshold_curve = model_performance_classification_statsmodels(
lg_final, X_train_final, y_train, threshold=optimal_threshold_curve
)
print("Training performance with optimal threshold from Precision-Recall curve:")
display(log_reg_model_train_perf_threshold_curve)
True Negatives (TN): 14091 False Positives (FP): 5460 False Negatives (FN): 2465 True Positives (TP): 7004
Training performance with optimal threshold from Precision-Recall curve:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.72691 | 0.73968 | 0.56194 | 0.63867 |
Observations:
Increased Recall, Lower Precision: The model's recall has increased significantly, meaning it catches most cancellations, but this has caused a drop in precision, leading to more false positives. This is a common trade-off when you adjust the threshold to prioritize recall over precision.
F1 Score Improvement: The F1 score of 0.61989 reflects a balance that might be better than using the default threshold (usually 0.5). The F1 score improved by increasing the recall at the cost of precision. This might be desirable if the cost of missing a cancellation (False Negative) is high.
Threshold Impact: By adjusting the threshold to the optimal value from the Precision-Recall curve, we have tailored the model's decision boundary to better suit the trade-off between precision and recall.
Checking model performance on test set¶
# Predict on the testing data and evaluate the model performance
# Plot the confusion matrix for the testing data
confusion_matrix_statsmodels(lg_final, X_test_final, y_test)
True Negatives (TN): 4206 False Positives (FP): 633 False Negatives (FN): 1154 True Positives (TP): 1262
# Checking the model performance on the test set
log_reg_model_test_perf = model_performance_classification_statsmodels(
lg_final, X_test_final, y_test
)
# Print the test set performance metrics
print("Test performance:")
display(log_reg_model_test_perf)
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.75369 | 0.52235 | 0.66596 | 0.58548 |
Observations:
- Consistency Between Training and Test Sets: The model's performance on the test set is consistent with its performance on the training set, indicating that the model is generalizing well and not overfitting.
- Balanced Performance: The precision and recall values are balanced, suggesting the model is fairly reliable at predicting cancellations, though there is room for improvement, especially in recall (capturing more true cancellations).
- Potential for Threshold Adjustment: Since the model's recall is somewhat low, you may want to explore adjusting the threshold on the test set as you did on the training set to see if the F1 score can be further improved.
ROC-AUC on Test Set¶
# Calculate ROC-AUC for the test set
logit_roc_auc_test = roc_auc_score(y_test, lg_final.predict(X_test_final))
# Get the false positive rate and true positive rate for the test set
fpr_test, tpr_test, thresholds_test = roc_curve(y_test, lg_final.predict(X_test_final))
# Plot the ROC curve for the test set
plt.figure(figsize=(7, 5))
plt.plot(
fpr_test, tpr_test, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_test
)
plt.plot([0, 1], [0, 1], "r--") # Diagonal line representing random guessing
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.01])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic (ROC) Curve - Test Set")
plt.legend(loc="lower right")
plt.show()
print("Test Set ROC-AUC:", logit_roc_auc_test)
Test Set ROC-AUC: 0.8018190279996004
Option 1: Adjust Threshold Using the Optimal Threshold Found in the Training Set (0.3643)¶
# Use the optimal threshold found in the training phase
confusion_matrix_statsmodels(
lg_final, X_test_final, y_test, threshold=optimal_threshold_curve
)
# Check the model performance on the test set with the new threshold
log_reg_model_test_perf_threshold = model_performance_classification_statsmodels(
lg_final, X_test_final, y_test, threshold=optimal_threshold_curve
)
print("Test performance with optimal threshold from training set:")
display(log_reg_model_test_perf_threshold)
True Negatives (TN): 3509 False Positives (FP): 1330 False Negatives (FN): 624 True Positives (TP): 1792
Test performance with optimal threshold from training set:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.73067 | 0.74172 | 0.57399 | 0.64717 |
Option 2: Determine a New Optimal Threshold for the Test Set¶
# Precision-Recall curve for test set
y_scores_test = lg_final.predict(X_test_final)
prec_test, rec_test, thresholds_test = precision_recall_curve(y_test, y_scores_test)
# Plot Precision-Recall curve
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b--", label="precision")
plt.plot(thresholds, recalls[:-1], "g--", label="recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec_test, rec_test, thresholds_test)
plt.title("Precision-Recall Curve - Test Set")
plt.show()
# Get the predicted probabilities for the test set
y_scores_test = lg_final.predict(X_test_final)
# Calculate precision, recall, and thresholds
prec_test, rec_test, thresholds_test = precision_recall_curve(y_test, y_scores_test)
# Calculate F1 scores for each threshold
f1_scores_test = 2 * (prec_test * rec_test) / (prec_test + rec_test)
# Find the threshold that gives the maximum F1 score
optimal_idx_test = np.argmax(f1_scores_test)
optimal_threshold_test = thresholds_test[optimal_idx_test]
print(
"Optimal Threshold based on Precision-Recall Curve for Test Set:",
optimal_threshold_test,
)
Optimal Threshold based on Precision-Recall Curve for Test Set: 0.3576726321099713
# Use the new optimal threshold to create a confusion matrix for the test set
confusion_matrix_statsmodels(
lg_final, X_test_final, y_test, threshold=optimal_threshold_test
)
# Check the model performance on the test set with the new threshold
log_reg_model_test_perf_new_threshold = model_performance_classification_statsmodels(
lg_final, X_test_final, y_test, threshold=optimal_threshold_test
)
print(
"Test performance with optimal threshold from Precision-Recall curve on Test Set:"
)
display(log_reg_model_test_perf_new_threshold)
True Negatives (TN): 3602 False Positives (FP): 1237 False Negatives (FN): 658 True Positives (TP): 1758
Test performance with optimal threshold from Precision-Recall curve on Test Set:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.73880 | 0.72765 | 0.58698 | 0.64979 |
Observations:
Option 2 seems to provide a slightly better overall balance, with a marginally higher accuracy, precision, and F1 score, even though recall is slightly lower. The improvement in precision and F1 score indicates that Option 2 might be more effective at reducing false positives while still maintaining a good level of recall.
Based on the comparison, I will be using the optimal threshold determined directly from the test set (Option 2), as it offers a better overall performance, especially in terms of precision and F1 score.
Final Logistic Model Performance¶
Training performance comparison¶
Test performance comparison¶
# Define your threshold values
default_threshold_test = 0.5 # The default threshold for logistic regression
optimal_threshold_auc_roc_test = (
optimal_threshold_curve # Example threshold from ROC-AUC (if used for test set)
)
optimal_threshold_curve_test = optimal_threshold_test # Example threshold from Precision-Recall (if used for test set)
# Create a DataFrame comparing test performance metrics for different thresholds
models_test_comp_df = pd.concat(
[
log_reg_model_test_perf.T, # Performance with the default threshold
log_reg_model_test_perf_threshold.T, # Performance with optimal threshold from ROC-AUC
log_reg_model_test_perf_new_threshold.T, # Performance with optimal threshold from Precision-Recall
],
axis=1,
)
# Rename columns to reflect the threshold used
models_test_comp_df.columns = [
f"Logistic Regression-default Threshold ({default_threshold_test})",
f"Logistic Regression-Threshold from ROC-AUC ({optimal_threshold_auc_roc_test})",
f"Logistic Regression-Threshold from Precision-Recall ({optimal_threshold_curve_test})",
]
# Display the comparison
print("Test performance comparison:")
display(models_test_comp_df)
Test performance comparison:
| Logistic Regression-default Threshold (0.5) | Logistic Regression-Threshold from ROC-AUC (0.3463375314218031) | Logistic Regression-Threshold from Precision-Recall (0.3576726321099713) | |
|---|---|---|---|
| Accuracy | 0.75369 | 0.73067 | 0.73880 |
| Recall | 0.52235 | 0.74172 | 0.72765 |
| Precision | 0.66596 | 0.57399 | 0.58698 |
| F1 | 0.58548 | 0.64717 | 0.64979 |
Observations:
Threshold Selection:
- Default Threshold: The default threshold achieves the highest accuracy and precision; however, it does so at the expense of recall. This means the model misses a significant number of true cancellations.
- Adjusted Thresholds: Particularly those optimized using the Precision-Recall curve on the test set, offer a more balanced trade-off between recall and precision. This balance is reflected in higher F1 scores, indicating better overall model performance.
Generalization:
Model Consistency: The model’s performance on the test set aligns closely with its performance on the training set, especially when using the adjusted thresholds. This consistency suggests that the model generalizes well to new, unseen data.
Optimal Threshold Selection: Based on the F1 score, which balances precision and recall, the Test Set Optimal Threshold (derived from the Precision-Recall curve) is recommended for deployment. This threshold strikes a good compromise by reducing the risk of missing cancellations (false negatives) while maintaining a manageable number of false positives.
Business Implication:
- Effective Cancellation Management: Deploying the model with the Test Set Optimal Threshold can help the hotel chain manage cancellations more effectively. By balancing the risks of over-preparing (false positives) and under-preparing (false negatives) for guest cancellations, the hotel can optimize its resources and maintain customer satisfaction.
6. Decision Tree Model Building¶
6.1 Data Preparation for modeling (Decision Tree)¶
- We want to predict which bookings will be canceled.
- Before we proceed to build a model, we'll have to encode categorical features.
- We'll split the data into train and test to be able to evaluate the model that we build on the train data.
# Separate features and target variable
X = dfmodeling.drop(["booking_status"], axis=1)
Y = dfmodeling["booking_status"]
# Encode categorical variables using pd.get_dummies()
X = pd.get_dummies(X, drop_first=True)
# Splitting data into train and test sets (70:30 split with random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1)
# Output the shapes and class distributions
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("\nPercentage of classes in training set:")
display(y_train.value_counts(normalize=True))
print("\nPercentage of classes in test set:")
display(y_test.value_counts(normalize=True))
Shape of Training set : (25392, 29) Shape of test set : (10883, 29) Percentage of classes in training set:
booking_status Not_Canceled 0.67064 Canceled 0.32936 Name: proportion, dtype: float64
Percentage of classes in test set:
booking_status Not_Canceled 0.67638 Canceled 0.32362 Name: proportion, dtype: float64
6.2 Decision Tree (default)¶
# Building the decision tree model
model0 = DecisionTreeClassifier(random_state=1)
model0.fit(X_train, y_train)
DecisionTreeClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=1)
6.3 Model Evaluation Criterion¶
Model can make wrong predictions as:
- False Negative (FN): Predicting that a booking will not be canceled when, in reality, it is canceled.
- False Positive (FP): Predicting that a booking will be canceled when, in reality, it is not.
Which Case is More Important?¶
False Negatives (FN):
- Revenue loss due to the inability to resell the room.
- Operational costs for last-minute cancellations, such as increased commission fees or price drops to sell the room quickly.
- Customer dissatisfaction if overbooking was done as a mitigation strategy and the hotel cannot accommodate all guests.
False Positives (FP):
- Underpreparing for the guest's arrival, leading to potential customer dissatisfaction.
- Incurring unnecessary operational adjustments, such as reallocating resources or staff.
How to Reduce the Losses?¶
Focus on F1 Score:
- The hotel should aim to maximize the F1 Score, a balanced metric that considers both precision (reducing false positives) and recall (reducing false negatives). By focusing on F1, the model aims to minimize the overall impact of both types of errors.
Balanced Approach:
- Given the importance of both types of errors (False Negatives leading to revenue loss and False Positives leading to operational inefficiencies), the F1 Score offers a balanced approach to optimize the model's performance.
Threshold Adjustment:
- Fine-tuning the decision threshold based on the F1 Score can help strike the right balance between precision and recall, minimizing the combined impact of false positives and false negatives.
6.4 Streamlining Model Evaluation with Functions¶
- Creating Functions: To enhance efficiency and avoid repetitive code, let's create functions that will calculate various metrics and generate a confusion matrix for each model.
model_performance_classification_sklearnFunction: This function will be used to evaluate the performance of our models, providing key metrics that help assess their effectiveness.confusion_matrix_sklearnFunction: This function will be used to plot the confusion matrix, allowing us to visualize the types of errors our models are making and identify areas for improvement.
# Convert y_train and y_test to numeric labels
y_train_numeric = y_train.map({"Canceled": 1, "Not_Canceled": 0})
y_test_numeric = y_test.map({"Canceled": 1, "Not_Canceled": 0})
# Function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# Predict using the independent variables
pred = model.predict(predictors)
# Convert predictions to numeric format if they are not already
pred_numeric = pd.Series(pred).map({"Canceled": 1, "Not_Canceled": 0})
# Calculate performance metrics
acc = accuracy_score(target, pred_numeric) # Compute Accuracy
recall = recall_score(target, pred_numeric) # Compute Recall
precision = precision_score(target, pred_numeric) # Compute Precision
f1 = f1_score(target, pred_numeric) # Compute F1-score
# Creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return df_perf
# Function to plot the confusion_matrix of a classification model
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages.
model: classifier
predictors: independent variables
target: dependent variable
"""
# Predict using the model
y_pred = model.predict(predictors)
# Convert predictions to numeric format
y_pred_numeric = pd.Series(y_pred).map({"Canceled": 1, "Not_Canceled": 0})
# Generate the confusion matrix
cm = confusion_matrix(target, y_pred_numeric)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
# Plot the confusion matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
plt.show()
Checking model performance on training set¶
# Create the confusion matrix for the training data using the numeric labels
confusion_matrix_sklearn(model0, X_train, y_train_numeric)
# Evaluate the model performance on the training data with numeric labels
decision_tree_perf_train = model_performance_classification_sklearn(
model0, X_train, y_train_numeric
)
# Display the training performance metrics
print("Decision Tree Model Performance on Training Set:")
display(decision_tree_perf_train)
Decision Tree Model Performance on Training Set:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.99421 | 0.99103 | 0.99139 | 0.99121 |
Checking model performance on test set¶
# Create the confusion matrix for the test data using the numeric labels
confusion_matrix_sklearn(model0, X_test, y_test_numeric)
# Evaluate the model performance on the test data with numeric labels
decision_tree_perf_test = model_performance_classification_sklearn(
model0, X_test, y_test_numeric
)
# Display the test performance metrics
print("Decision Tree Model Performance on Test Set:")
display(decision_tree_perf_test)
Decision Tree Model Performance on Test Set:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.86971 | 0.80863 | 0.79287 | 0.80067 |
Observations:
Overfitting: The model performs exceptionally well on the training set with near-perfect metrics, indicating potential overfitting. This is where the model might be capturing noise in the training data as if it were a pattern, leading to less generalized performance on unseen data.
Generalization to Test Set: While the test set performance is still strong, the drop in metrics (especially accuracy and precision) compared to the training set suggests that the model might not generalize as well as expected. However, an F1 score of ~80.07% on the test set is still quite good, indicating that the model is effectively balancing precision and recall.
Do we need to prune the tree?¶
Yes. To combat overfitting, we will be pruning the decision tree to simplify the model and improve its generalization ability.
Assessing Important Features
# Get feature names from the training set
feature_names = list(X_train.columns)
# Get feature importances from the model
importances = model0.feature_importances_
# Sort the feature importances in ascending order
indices = np.argsort(importances)
# Plot the feature importances
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Observations:
Lead Time: This feature is the most significant, contributing the highest relative importance. It suggests that the number of days between booking and arrival date strongly influences whether a booking is canceled.Average Price per Room: The second most important feature, indicating that the room's average price also plays a crucial role in predicting cancellations. Higher or lower room prices may correlate with different cancellation behaviors.Market Segment Type - Online: This feature has a significant impact, suggesting that bookings made through online channels are particularly important in predicting cancellations.Arrival Date and Month: These features also show considerable importance, indicating that the specific timing of the booking influences the likelihood of cancellation.Any Special Requests: Guests who make special requests also have a noticeable impact on the model, perhaps indicating specific customer behaviors related to cancellations.- Other Features: Features like the number of weekend nights, weeknights, total nights, and specific room types have lower but still relevant importance.
6.5 Pruning the Tree¶
Pre-pruning the Tree¶
Using GridSearch for Hyperparameter tuning of our tree model
- Hyperparameter tuning is also tricky in the sense that there is no direct way to calculate how a change in the hyperparameter value will reduce the loss of your model, so we usually resort to experimentation. i.e we'll use Grid search
- Grid search is a tuning technique that attempts to compute the optimum values of hyperparameters.
- It is an exhaustive search that is performed on a the specific parameter values of a model.
- The parameters of the estimator/model used to apply these methods are optimized by cross-validated grid-search over a parameter grid.
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1, class_weight="balanced")
# Grid of parameters to choose from
parameters = {
"max_depth": np.arange(2, 7, 2), # Depth of the tree
"max_leaf_nodes": [50, 75, 150, 250], # Maximum number of leaf nodes
"min_samples_split": [
10,
30,
50,
70,
], # Minimum number of samples required to split an internal node
}
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(
f1_score, pos_label="Canceled"
) # F1 score focusing on the "Canceled" class
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
best_estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data
best_estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', max_depth=6, max_leaf_nodes=50,
min_samples_split=10, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', max_depth=6, max_leaf_nodes=50,
min_samples_split=10, random_state=1)Checking performance on training set¶
# Checking performance on the training set
confusion_matrix_sklearn(best_estimator, X_train, y_train_numeric)
# Evaluating the model performance on the training data
decision_tree_pruned_perf_train = model_performance_classification_sklearn(
best_estimator, X_train, y_train_numeric
)
# Displaying the training performance metrics
print("Pruned Decision Tree Model Performance on Training Set:")
display(decision_tree_pruned_perf_train)
Pruned Decision Tree Model Performance on Training Set:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.83495 | 0.78022 | 0.73496 | 0.75692 |
Checking performance on test set¶
# Checking performance on the test set
confusion_matrix_sklearn(best_estimator, X_test, y_test_numeric)
# Evaluating the model performance on the test data
decision_tree_pruned_perf_test = model_performance_classification_sklearn(
best_estimator, X_test, y_test_numeric
)
# Displaying the test performance metrics
print("Pruned Decision Tree Model Performance on Test Set:")
display(decision_tree_pruned_perf_test)
Pruned Decision Tree Model Performance on Test Set:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.83736 | 0.77683 | 0.73548 | 0.75559 |
Observations:
- Consistency: The pruned decision tree shows very similar performance on both the training and test sets, indicating good generalization and avoiding overfitting. The accuracy, recall, precision, and F1 scores are all closely aligned across both datasets.
- Balanced Trade-off: The F1 score is quite balanced, with values close to 76% on both sets. This suggests that the pruned decision tree maintains a good balance between precision and recall, which is important when F1 is your evaluation metric.
- Improvement over Initial Model: The pruning process has likely resulted in a simpler model with better generalization abilities. Compared to the unpruned model, the pruned model avoids overly complex decision rules that might not generalize well to unseen data.
6.6 Model Visualization¶
# Visualizing the pre-pruned decision tree
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
best_estimator,
feature_names=X_train.columns,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# Add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of the decision tree
print(
tree.export_text(
best_estimator, feature_names=list(X_train.columns), show_weights=True
)
)
|--- lead_time <= 151.50 | |--- any_special_requests <= 0.50 | | |--- market_segment_type_Online <= 0.50 | | | |--- lead_time <= 90.50 | | | | |--- total_nights <= 5.50 | | | | | |--- avg_price_per_room <= 201.50 | | | | | | |--- weights: [419.00, 2739.90] class: Not_Canceled | | | | | |--- avg_price_per_room > 201.50 | | | | | | |--- weights: [25.81, 1.49] class: Canceled | | | | |--- total_nights > 5.50 | | | | | |--- avg_price_per_room <= 92.80 | | | | | | |--- weights: [12.14, 68.59] class: Not_Canceled | | | | | |--- avg_price_per_room > 92.80 | | | | | | |--- weights: [85.01, 17.15] class: Canceled | | | |--- lead_time > 90.50 | | | | |--- lead_time <= 117.50 | | | | | |--- avg_price_per_room <= 93.58 | | | | | | |--- weights: [227.72, 214.72] class: Canceled | | | | | |--- avg_price_per_room > 93.58 | | | | | | |--- weights: [285.41, 82.76] class: Canceled | | | | |--- lead_time > 117.50 | | | | | |--- no_of_week_nights <= 1.50 | | | | | | |--- weights: [81.98, 87.23] class: Not_Canceled | | | | | |--- no_of_week_nights > 1.50 | | | | | | |--- weights: [48.58, 228.14] class: Not_Canceled | | |--- market_segment_type_Online > 0.50 | | | |--- lead_time <= 13.50 | | | | |--- avg_price_per_room <= 99.44 | | | | | |--- arrival_month <= 1.50 | | | | | | |--- weights: [0.00, 92.45] class: Not_Canceled | | | | | |--- arrival_month > 1.50 | | | | | | |--- weights: [132.08, 363.83] class: Not_Canceled | | | | |--- avg_price_per_room > 99.44 | | | | | |--- lead_time <= 3.50 | | | | | | |--- weights: [85.01, 219.94] class: Not_Canceled | | | | | |--- lead_time > 3.50 | | | | | | |--- weights: [280.85, 132.71] class: Canceled | | | |--- lead_time > 13.50 | | | | |--- required_car_parking_space <= 0.50 | | | | | |--- avg_price_per_room <= 71.92 | | | | | | |--- weights: [159.40, 158.80] class: Canceled | | | | | |--- avg_price_per_room > 71.92 | | | | | | |--- weights: [3543.28, 850.67] class: Canceled | | | | |--- required_car_parking_space > 0.50 | | | | | |--- weights: [1.52, 48.46] class: Not_Canceled | |--- any_special_requests > 0.50 | | |--- no_of_special_requests <= 1.50 | | | |--- market_segment_type_Online <= 0.50 | | | | |--- lead_time <= 102.50 | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | |--- weights: [9.11, 697.09] class: Not_Canceled | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | |--- weights: [9.11, 15.66] class: Not_Canceled | | | | |--- lead_time > 102.50 | | | | | |--- no_of_week_nights <= 2.50 | | | | | | |--- weights: [19.74, 32.06] class: Not_Canceled | | | | | |--- no_of_week_nights > 2.50 | | | | | | |--- weights: [3.04, 44.73] class: Not_Canceled | | | |--- market_segment_type_Online > 0.50 | | | | |--- lead_time <= 8.50 | | | | | |--- lead_time <= 4.50 | | | | | | |--- weights: [44.03, 498.03] class: Not_Canceled | | | | | |--- lead_time > 4.50 | | | | | | |--- weights: [63.76, 258.71] class: Not_Canceled | | | | |--- lead_time > 8.50 | | | | | |--- required_car_parking_space <= 0.50 | | | | | | |--- weights: [1451.32, 2512.51] class: Not_Canceled | | | | | |--- required_car_parking_space > 0.50 | | | | | | |--- weights: [1.52, 134.20] class: Not_Canceled | | |--- no_of_special_requests > 1.50 | | | |--- lead_time <= 90.50 | | | | |--- no_of_week_nights <= 3.50 | | | | | |--- weights: [0.00, 1585.04] class: Not_Canceled | | | | |--- no_of_week_nights > 3.50 | | | | | |--- no_of_special_requests <= 2.50 | | | | | | |--- weights: [57.69, 180.42] class: Not_Canceled | | | | | |--- no_of_special_requests > 2.50 | | | | | | |--- weights: [0.00, 52.19] class: Not_Canceled | | | |--- lead_time > 90.50 | | | | |--- no_of_special_requests <= 2.50 | | | | | |--- arrival_month <= 8.50 | | | | | | |--- weights: [56.17, 184.90] class: Not_Canceled | | | | | |--- arrival_month > 8.50 | | | | | | |--- weights: [106.27, 106.61] class: Not_Canceled | | | | |--- no_of_special_requests > 2.50 | | | | | |--- weights: [0.00, 67.10] class: Not_Canceled |--- lead_time > 151.50 | |--- avg_price_per_room <= 100.04 | | |--- any_special_requests <= 0.50 | | | |--- no_of_adults <= 1.50 | | | | |--- market_segment_type_Online <= 0.50 | | | | | |--- lead_time <= 163.50 | | | | | | |--- weights: [24.29, 3.73] class: Canceled | | | | | |--- lead_time > 163.50 | | | | | | |--- weights: [62.24, 257.96] class: Not_Canceled | | | | |--- market_segment_type_Online > 0.50 | | | | | |--- avg_price_per_room <= 2.50 | | | | | | |--- weights: [3.04, 8.95] class: Not_Canceled | | | | | |--- avg_price_per_room > 2.50 | | | | | | |--- weights: [97.16, 0.75] class: Canceled | | | |--- no_of_adults > 1.50 | | | | |--- avg_price_per_room <= 82.47 | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | |--- weights: [282.37, 2.98] class: Canceled | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | |--- weights: [385.60, 213.97] class: Canceled | | | | |--- avg_price_per_room > 82.47 | | | | | |--- no_of_adults <= 2.50 | | | | | | |--- weights: [1030.80, 23.86] class: Canceled | | | | | |--- no_of_adults > 2.50 | | | | | | |--- weights: [0.00, 5.22] class: Not_Canceled | | |--- any_special_requests > 0.50 | | | |--- no_of_weekend_nights <= 0.50 | | | | |--- lead_time <= 180.50 | | | | | |--- lead_time <= 159.50 | | | | | | |--- weights: [7.59, 7.46] class: Canceled | | | | | |--- lead_time > 159.50 | | | | | | |--- weights: [4.55, 37.28] class: Not_Canceled | | | | |--- lead_time > 180.50 | | | | | |--- no_of_special_requests <= 2.50 | | | | | | |--- weights: [212.54, 20.13] class: Canceled | | | | | |--- no_of_special_requests > 2.50 | | | | | | |--- weights: [0.00, 8.95] class: Not_Canceled | | | |--- no_of_weekend_nights > 0.50 | | | | |--- market_segment_type_Offline <= 0.50 | | | | | |--- arrival_month <= 11.50 | | | | | | |--- weights: [110.82, 231.12] class: Not_Canceled | | | | | |--- arrival_month > 11.50 | | | | | | |--- weights: [34.92, 19.38] class: Canceled | | | | |--- market_segment_type_Offline > 0.50 | | | | | |--- lead_time <= 348.50 | | | | | | |--- weights: [3.04, 106.61] class: Not_Canceled | | | | | |--- lead_time > 348.50 | | | | | | |--- weights: [4.55, 5.96] class: Not_Canceled | |--- avg_price_per_room > 100.04 | | |--- arrival_month <= 11.50 | | | |--- no_of_special_requests <= 2.50 | | | | |--- weights: [3200.19, 0.00] class: Canceled | | | |--- no_of_special_requests > 2.50 | | | | |--- weights: [0.00, 23.11] class: Not_Canceled | | |--- arrival_month > 11.50 | | | |--- any_special_requests <= 0.50 | | | | |--- weights: [0.00, 35.04] class: Not_Canceled | | | |--- any_special_requests > 0.50 | | | | |--- arrival_date <= 24.50 | | | | | |--- weights: [0.00, 3.73] class: Not_Canceled | | | | |--- arrival_date > 24.50 | | | | | |--- weights: [22.77, 3.73] class: Canceled
Observations:
- The tree structure shows the importance of the
lead_time,avg_price_per_room, andany_special_requestsfeatures, among others. These features are repeatedly used to make splits, indicating their significant role in predicting booking cancellations.
# Display the importances
importances = best_estimator.feature_importances_
print(importances)
[0.02702997 0. 0.01243347 0.00702817 0.01415876 0.4676694 0. 0.01418297 0.00076349 0. 0. 0. 0.08210707 0.03096242 0.00839254 0.13891175 0. 0. 0.00095344 0. 0. 0. 0. 0. 0. 0. 0. 0.01004782 0.18535873]
# Get the importance of each feature in the decision tree
importances = best_estimator.feature_importances_
indices = np.argsort(importances)
# Plotting the feature importances
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
- It appears that our tree is being simplified based on the new important features
6.7 Cost Complexity Pruning¶
- The
DecisionTreeClassifierprovides parameters such asmin_samples_leafandmax_depthto prevent a tree from overfiting. Cost complexity pruning provides another option to control the size of a tree. InDecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter,ccp_alpha. Greater values ofccp_alphaincrease the number of nodes pruned. Here we only show the effect ofccp_alphaon regularizing the trees and how to choose accp_alphabased on validation scores.
Total impurity of leaves vs effective alphas of pruned tree
- Minimal cost complexity pruning recursively finds the node with the "weakest link". The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first. To get an idea of what values of
ccp_alphacould be appropriate, scikit-learn providesDecisionTreeClassifier.cost_complexity_pruning_paththat returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process. As alpha increases, more of the tree is pruned, which increases the total impurity of its leaves.
# Initialize the classifier (ensure consistency with your previous settings)
clf = DecisionTreeClassifier(random_state=1, class_weight="balanced")
# Get the cost complexity pruning path
path = clf.cost_complexity_pruning_path(X_train, y_train_numeric)
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
# Display the pruning path data
pruning_path_df = pd.DataFrame({"Alpha": ccp_alphas, "Total Impurity": impurities})
display(pruning_path_df)
| Alpha | Total Impurity | |
|---|---|---|
| 0 | 0.00000 | 0.00838 |
| 1 | 0.00000 | 0.00838 |
| 2 | 0.00000 | 0.00838 |
| 3 | 0.00000 | 0.00838 |
| 4 | 0.00000 | 0.00838 |
| ... | ... | ... |
| 1841 | 0.00890 | 0.32806 |
| 1842 | 0.00980 | 0.33786 |
| 1843 | 0.01272 | 0.35058 |
| 1844 | 0.03412 | 0.41882 |
| 1845 | 0.08118 | 0.50000 |
1846 rows × 2 columns
# Plot the total impurity vs effective alpha
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("Effective Alpha")
ax.set_ylabel("Total Impurity of Leaves")
ax.set_title("Total Impurity vs Effective Alpha for Training Set")
plt.show()
Observations:
- Left Side (Low Alpha): The tree is complex, with many leaves. The total impurity is low, but the model might be overfitting.
- Right Side (High Alpha): The tree is pruned more aggressively. The total impurity increases, indicating a simpler tree with fewer leaves, which may lead to underfitting.
Next, we train a decision tree using effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.
# Initialize list to store decision trees trained with different alpha values
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(
random_state=1, ccp_alpha=ccp_alpha, class_weight="balanced"
)
clf.fit(X_train, y_train_numeric)
clfs.append(clf)
# Print out the number of nodes in the last tree
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.08117914389137282
For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.
# Remove the trivial tree with only one node
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
# Calculate the number of nodes and the depth for each tree
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
# Plot the results
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
plt.show()
Number of Nodes vs Alpha
- As the value of alpha increases, the number of nodes in the tree decreases significantly, which is expected because higher alpha values lead to more aggressive pruning.
- The number of nodes stabilizes after a certain point (around alpha > 0.004), meaning that further increasing alpha doesn't significantly reduce the number of nodes.
Depth vs Alpha
- Similarly, the depth of the tree decreases rapidly as alpha increases, stabilizing after alpha > 0.004. This stabilization suggests that beyond a certain alpha, the tree doesn't become much simpler in terms of depth.
6.8 F1 Score vs alpha for training and testing sets¶
# Calculate F1 Scores for training and testing sets
f1_train = []
f1_test = []
for clf in clfs:
pred_train = clf.predict(X_train)
f1_train.append(f1_score(y_train_numeric, pred_train))
pred_test = clf.predict(X_test)
f1_test.append(f1_score(y_test_numeric, pred_test))
# Plot F1 Score vs alpha for training and testing sets
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("F1 Score")
ax.set_title("F1 Score vs alpha for training and testing sets")
ax.plot(ccp_alphas, f1_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, f1_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
# Find the best alpha (the one that gives the highest F1 score on the test set)
index_best_model = np.argmax(f1_test)
best_model = clfs[index_best_model]
print("Best alpha:", ccp_alphas[index_best_model])
print("Best Decision Tree Model:", best_model)
Best alpha: 0.00010671446444450493
Best Decision Tree Model: DecisionTreeClassifier(ccp_alpha=0.00010671446444450493,
class_weight='balanced', random_state=1)
6.9 Checking performance on Training and Test set¶
# Refit the best model without using the ccp_alpha parameter
best_model_unpruned = DecisionTreeClassifier(random_state=1, class_weight="balanced")
best_model_unpruned.fit(X_train, y_train)
# Check performance on the training set
confusion_matrix_sklearn(best_model_unpruned, X_train, y_train_numeric)
decision_tree_perf_train_unpruned = model_performance_classification_sklearn(
best_model_unpruned, X_train, y_train_numeric
)
print("Decision Tree Model Performance on Training Set (Unpruned):")
display(decision_tree_perf_train_unpruned)
# Check performance on the test set
confusion_matrix_sklearn(best_model_unpruned, X_test, y_test_numeric)
decision_tree_perf_test_unpruned = model_performance_classification_sklearn(
best_model_unpruned, X_test, y_test_numeric
)
print("Decision Tree Model Performance on Test Set (Unpruned):")
display(decision_tree_perf_test_unpruned)
Decision Tree Model Performance on Training Set (Unpruned):
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.99311 | 0.99510 | 0.98415 | 0.98960 |
Decision Tree Model Performance on Test Set (Unpruned):
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.86529 | 0.81062 | 0.78134 | 0.79571 |
# Set a fixed alpha value for pruning (choose a reasonable small value)
fixed_alpha = 0.001 # You can adjust this value based on your data
# Refit the decision tree with this fixed alpha
best_model_fixed_alpha = DecisionTreeClassifier(
random_state=1, ccp_alpha=fixed_alpha, class_weight="balanced"
)
best_model_fixed_alpha.fit(X_train, y_train)
# Check performance on the training set
confusion_matrix_sklearn(best_model_fixed_alpha, X_train, y_train_numeric)
decision_tree_perf_train_fixed_alpha = model_performance_classification_sklearn(
best_model_fixed_alpha, X_train, y_train_numeric
)
print("Decision Tree Model Performance on Training Set (Fixed Alpha):")
print(decision_tree_perf_train_fixed_alpha)
# Check performance on the test set
confusion_matrix_sklearn(best_model_fixed_alpha, X_test, y_test_numeric)
decision_tree_perf_test_fixed_alpha = model_performance_classification_sklearn(
best_model_fixed_alpha, X_test, y_test_numeric
)
print("Decision Tree Model Performance on Test Set (Fixed Alpha):")
print(decision_tree_perf_test_fixed_alpha)
Decision Tree Model Performance on Training Set (Fixed Alpha): Accuracy Recall Precision F1 0 0.85586 0.77998 0.78185 0.78092
Decision Tree Model Performance on Test Set (Fixed Alpha): Accuracy Recall Precision F1 0 0.85271 0.77200 0.77266 0.77233
# Verify that there are no NaNs in the feature sets
print("NaNs in X_train:", X_train.isnull().sum().sum())
print("NaNs in X_test:", X_test.isnull().sum().sum())
# Verify the integrity of y_train and y_test
print("Unique values in y_train:", y_train.unique())
print("Unique values in y_test:", y_test.unique())
NaNs in X_train: 0 NaNs in X_test: 0 Unique values in y_train: ['Canceled' 'Not_Canceled'] Unique values in y_test: ['Not_Canceled' 'Canceled']
Model Performance Comparison and Conclusions¶
# Compare the performance of models
model_comparison_df = pd.concat(
[
decision_tree_perf_train_unpruned.T, # Pre-pruned Decision Tree
decision_tree_perf_train_fixed_alpha.T, # Post-pruned Decision Tree
log_reg_model_train_perf.T, # Logistic Regression (from earlier)
],
axis=1,
)
model_comparison_df.columns = [
"Pre-pruned Decision Tree",
"Post-pruned Decision Tree",
"Logistic Regression",
]
print("Model Performance Comparison:")
display(model_comparison_df)
Model Performance Comparison:
| Pre-pruned Decision Tree | Post-pruned Decision Tree | Logistic Regression | |
|---|---|---|---|
| Accuracy | 0.99311 | 0.85586 | 0.75589 |
| Recall | 0.99510 | 0.77998 | 0.52772 |
| Precision | 0.98415 | 0.78185 | 0.65672 |
| F1 | 0.98960 | 0.78092 | 0.58520 |
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
best_model_fixed_alpha,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Print the rules of the pruned decision tree
print(
tree.export_text(
best_model_fixed_alpha, feature_names=feature_names, show_weights=True
)
)
|--- lead_time <= 151.50 | |--- any_special_requests <= 0.50 | | |--- market_segment_type_Online <= 0.50 | | | |--- lead_time <= 90.50 | | | | |--- total_nights <= 5.50 | | | | | |--- avg_price_per_room <= 201.50 | | | | | | |--- no_of_weekend_nights <= 0.50 | | | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | | | |--- weights: [132.08, 536.80] class: Not_Canceled | | | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | | | |--- weights: [0.00, 1199.59] class: Not_Canceled | | | | | | |--- no_of_weekend_nights > 0.50 | | | | | | | |--- lead_time <= 74.50 | | | | | | | | |--- arrival_date <= 27.50 | | | | | | | | | |--- weights: [141.18, 854.40] class: Not_Canceled | | | | | | | | |--- arrival_date > 27.50 | | | | | | | | | |--- lead_time <= 1.50 | | | | | | | | | | |--- weights: [53.13, 8.95] class: Canceled | | | | | | | | | |--- lead_time > 1.50 | | | | | | | | | | |--- weights: [4.55, 58.15] class: Not_Canceled | | | | | | | |--- lead_time > 74.50 | | | | | | | | |--- weights: [88.05, 82.01] class: Canceled | | | | | |--- avg_price_per_room > 201.50 | | | | | | |--- weights: [25.81, 1.49] class: Canceled | | | | |--- total_nights > 5.50 | | | | | |--- avg_price_per_room <= 92.80 | | | | | | |--- weights: [12.14, 68.59] class: Not_Canceled | | | | | |--- avg_price_per_room > 92.80 | | | | | | |--- weights: [85.01, 17.15] class: Canceled | | | |--- lead_time > 90.50 | | | | |--- lead_time <= 117.50 | | | | | |--- avg_price_per_room <= 93.58 | | | | | | |--- avg_price_per_room <= 75.07 | | | | | | | |--- no_of_week_nights <= 2.50 | | | | | | | | |--- weights: [154.85, 25.35] class: Canceled | | | | | | | |--- no_of_week_nights > 2.50 | | | | | | | | |--- weights: [15.18, 60.39] class: Not_Canceled | | | | | | |--- avg_price_per_room > 75.07 | | | | | | | |--- weights: [57.69, 128.98] class: Not_Canceled | | | | | |--- avg_price_per_room > 93.58 | | | | | | |--- weights: [285.41, 82.76] class: Canceled | | | | |--- lead_time > 117.50 | | | | | |--- weights: [130.56, 315.37] class: Not_Canceled | | |--- market_segment_type_Online > 0.50 | | | |--- lead_time <= 13.50 | | | | |--- avg_price_per_room <= 99.44 | | | | | |--- weights: [132.08, 456.28] class: Not_Canceled | | | | |--- avg_price_per_room > 99.44 | | | | | |--- lead_time <= 3.50 | | | | | | |--- weights: [85.01, 219.94] class: Not_Canceled | | | | | |--- lead_time > 3.50 | | | | | | |--- arrival_month <= 8.50 | | | | | | | |--- weights: [232.27, 61.14] class: Canceled | | | | | | |--- arrival_month > 8.50 | | | | | | | |--- weights: [48.58, 71.57] class: Not_Canceled | | | |--- lead_time > 13.50 | | | | |--- required_car_parking_space <= 0.50 | | | | | |--- avg_price_per_room <= 71.92 | | | | | | |--- weights: [159.40, 158.80] class: Canceled | | | | | |--- avg_price_per_room > 71.92 | | | | | | |--- arrival_year <= 2017.50 | | | | | | | |--- lead_time <= 65.50 | | | | | | | | |--- weights: [21.25, 87.23] class: Not_Canceled | | | | | | | |--- lead_time > 65.50 | | | | | | | | |--- weights: [110.82, 20.13] class: Canceled | | | | | | |--- arrival_year > 2017.50 | | | | | | | |--- weights: [3411.21, 743.32] class: Canceled | | | | |--- required_car_parking_space > 0.50 | | | | | |--- weights: [1.52, 48.46] class: Not_Canceled | |--- any_special_requests > 0.50 | | |--- no_of_special_requests <= 1.50 | | | |--- market_segment_type_Online <= 0.50 | | | | |--- weights: [40.99, 789.54] class: Not_Canceled | | | |--- market_segment_type_Online > 0.50 | | | | |--- lead_time <= 8.50 | | | | | |--- weights: [107.79, 756.73] class: Not_Canceled | | | | |--- lead_time > 8.50 | | | | | |--- required_car_parking_space <= 0.50 | | | | | | |--- avg_price_per_room <= 118.55 | | | | | | | |--- lead_time <= 61.50 | | | | | | | | |--- weights: [337.02, 972.94] class: Not_Canceled | | | | | | | |--- lead_time > 61.50 | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | |--- weights: [112.34, 46.97] class: Canceled | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | |--- weights: [346.13, 680.69] class: Not_Canceled | | | | | | |--- avg_price_per_room > 118.55 | | | | | | | |--- weights: [655.83, 811.91] class: Not_Canceled | | | | | |--- required_car_parking_space > 0.50 | | | | | | |--- weights: [1.52, 134.20] class: Not_Canceled | | |--- no_of_special_requests > 1.50 | | | |--- lead_time <= 90.50 | | | | |--- weights: [57.69, 1817.66] class: Not_Canceled | | | |--- lead_time > 90.50 | | | | |--- weights: [162.44, 358.61] class: Not_Canceled |--- lead_time > 151.50 | |--- avg_price_per_room <= 100.04 | | |--- any_special_requests <= 0.50 | | | |--- no_of_adults <= 1.50 | | | | |--- market_segment_type_Online <= 0.50 | | | | | |--- weights: [86.53, 261.69] class: Not_Canceled | | | | |--- market_segment_type_Online > 0.50 | | | | | |--- weights: [100.20, 9.69] class: Canceled | | | |--- no_of_adults > 1.50 | | | | |--- avg_price_per_room <= 82.47 | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | |--- weights: [282.37, 2.98] class: Canceled | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | |--- arrival_month <= 11.50 | | | | | | | |--- lead_time <= 244.00 | | | | | | | | |--- total_nights <= 2.50 | | | | | | | | | |--- weights: [57.69, 4.47] class: Canceled | | | | | | | | |--- total_nights > 2.50 | | | | | | | | | |--- weights: [27.33, 104.38] class: Not_Canceled | | | | | | | |--- lead_time > 244.00 | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | |--- weights: [0.00, 25.35] class: Not_Canceled | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | |--- weights: [300.59, 33.55] class: Canceled | | | | | | |--- arrival_month > 11.50 | | | | | | | |--- weights: [0.00, 46.22] class: Not_Canceled | | | | |--- avg_price_per_room > 82.47 | | | | | |--- weights: [1030.80, 29.08] class: Canceled | | |--- any_special_requests > 0.50 | | | |--- no_of_weekend_nights <= 0.50 | | | | |--- lead_time <= 180.50 | | | | | |--- weights: [12.14, 44.73] class: Not_Canceled | | | | |--- lead_time > 180.50 | | | | | |--- weights: [212.54, 29.08] class: Canceled | | | |--- no_of_weekend_nights > 0.50 | | | | |--- weights: [153.33, 363.08] class: Not_Canceled | |--- avg_price_per_room > 100.04 | | |--- arrival_month <= 11.50 | | | |--- no_of_special_requests <= 2.50 | | | | |--- weights: [3200.19, 0.00] class: Canceled | | | |--- no_of_special_requests > 2.50 | | | | |--- weights: [0.00, 23.11] class: Not_Canceled | | |--- arrival_month > 11.50 | | | |--- weights: [22.77, 42.50] class: Not_Canceled
importances = best_model_fixed_alpha.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances (Pruned Tree)")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Actionable Insights and Recommendations¶
- What profitable policies for cancellations and refunds can the hotel adopt?
- What other recommedations would you suggest to the hotel?
1. Lead Time is the Most Significant Factor¶
- Insight: The analysis consistently shows that lead time is the most influential factor in predicting booking cancellations. The longer the lead time, the higher the probability of cancellation.
- Recommendation: The hotel should consider implementing stricter cancellation policies for bookings made far in advance. For example, bookings with a lead time greater than a certain threshold (e.g., 60 days) could require a non-refundable deposit or a stricter cancellation policy to reduce the risk of last-minute cancellations.
2. Special Requests as a Predictor¶
- Insight: Special requests were identified as significant predictors of cancellations. Customers who make special requests may have specific expectations and may be more likely to cancel if these expectations are not met.
- Recommendation: The hotel could implement a follow-up process for guests with special requests to ensure their needs are met, potentially reducing the likelihood of cancellations. Additionally, offering incentives for guests with special requests to finalize their booking (e.g., discounts on additional services) could help reduce cancellations.
3. Market Segment Type¶
- Insight: The market segment type, particularly online bookings, has a substantial impact on the likelihood of cancellations.
- Recommendation: The hotel could create tailored cancellation policies based on the booking channel. For example, stricter policies could be applied to online bookings where the cancellation rate is higher, while more flexible policies could be offered for bookings made directly through the hotel's website or via phone.
4. Price Sensitivity¶
- Insight: Rooms with lower prices tend to have a higher likelihood of being canceled, indicating price-sensitive customers.
- Recommendation: Implement a tiered pricing structure where non-refundable rates are offered at a lower price point, while fully refundable rates are priced higher. This approach can cater to both price-sensitive customers and those who prefer flexibility, balancing the risk of cancellations with revenue protection.
5. Profitable Cancellation and Refund Policies¶
- Non-refundable Rates: Introduce non-refundable rates at a lower price point to attract customers who are certain about their travel plans. This could be coupled with a small incentive, such as a discount on future bookings.
- Flexible Rates with Conditions: Offer flexible cancellation rates with conditions such as a minimum lead time for cancellations to qualify for a full refund. For instance, bookings canceled within 14 days of the check-in date might receive only a partial refund.
- Dynamic Cancellation Policies: Implement dynamic cancellation policies that adjust based on lead time and the market segment. For example, bookings made via online platforms with a lead time of over 30 days might have stricter cancellation terms compared to direct bookings.
6. Other Recommendations¶
- Enhanced Customer Engagement: For bookings with a higher likelihood of cancellation, such as those made with a long lead time, consider implementing engagement strategies like pre-stay emails or offers to encourage guests to maintain their reservations.
- Data-Driven Decision Making: Continuously monitor and analyze booking patterns and cancellation rates. Use this data to refine and optimize cancellation policies regularly. Consider deploying machine learning models that can predict cancellations more accurately and provide personalized recommendations to customers at the time of booking.
- Offer Insurance Options: Provide guests with the option to purchase travel insurance that covers cancellations. This could be an additional revenue stream for the hotel while also providing peace of mind to customers.
These recommendations aim to reduce the financial impact of cancellations on the hotel while maintaining customer satisfaction and loyalty. By leveraging the insights from the decision tree model and other analyses, the hotel can implement strategies that protect revenue and enhance the overall guest experience.
3.1 Univariate Analysis¶
Univariate analysis involves:
- Plotting histograms or bar charts for categorical variables to visualize their frequency distributions.
- Plotting histograms or box plots for numerical variables to understand their central tendency and dispersion.
3.1.1 Analysis of Numerical Features¶
# Making a list of all numerical variables ('int64', 'float64', 'complex')
num_cols = df_eda.select_dtypes(include=["int64", "float64", "complex"]).columns
# Iterate through each numerical column and plot the histogram and boxplot combined
for column in num_cols:
print(f"Distribution of '{column}'")
print(df_eda[column].describe())
histogram_boxplot(df_eda, column, bins=50, kde=True)
print("-" * 100)
Distribution of 'no_of_adults' count 36275.00000 mean 1.84496 std 0.51871 min 0.00000 25% 2.00000 50% 2.00000 75% 2.00000 max 4.00000 Name: no_of_adults, dtype: float64
---------------------------------------------------------------------------------------------------- Distribution of 'no_of_children' count 36275.00000 mean 0.10528 std 0.40265 min 0.00000 25% 0.00000 50% 0.00000 75% 0.00000 max 10.00000 Name: no_of_children, dtype: float64
---------------------------------------------------------------------------------------------------- Distribution of 'no_of_weekend_nights' count 36275.00000 mean 0.81072 std 0.87064 min 0.00000 25% 0.00000 50% 1.00000 75% 2.00000 max 7.00000 Name: no_of_weekend_nights, dtype: float64
---------------------------------------------------------------------------------------------------- Distribution of 'no_of_week_nights' count 36275.00000 mean 2.20430 std 1.41090 min 0.00000 25% 1.00000 50% 2.00000 75% 3.00000 max 17.00000 Name: no_of_week_nights, dtype: float64
---------------------------------------------------------------------------------------------------- Distribution of 'required_car_parking_space' count 36275.00000 mean 0.03099 std 0.17328 min 0.00000 25% 0.00000 50% 0.00000 75% 0.00000 max 1.00000 Name: required_car_parking_space, dtype: float64
---------------------------------------------------------------------------------------------------- Distribution of 'lead_time' count 36275.00000 mean 85.23256 std 85.93082 min 0.00000 25% 17.00000 50% 57.00000 75% 126.00000 max 443.00000 Name: lead_time, dtype: float64
---------------------------------------------------------------------------------------------------- Distribution of 'arrival_year' count 36275.00000 mean 2017.82043 std 0.38384 min 2017.00000 25% 2018.00000 50% 2018.00000 75% 2018.00000 max 2018.00000 Name: arrival_year, dtype: float64
---------------------------------------------------------------------------------------------------- Distribution of 'arrival_month' count 36275.00000 mean 7.42365 std 3.06989 min 1.00000 25% 5.00000 50% 8.00000 75% 10.00000 max 12.00000 Name: arrival_month, dtype: float64
---------------------------------------------------------------------------------------------------- Distribution of 'arrival_date' count 36275.00000 mean 15.59700 std 8.74045 min 1.00000 25% 8.00000 50% 16.00000 75% 23.00000 max 31.00000 Name: arrival_date, dtype: float64
---------------------------------------------------------------------------------------------------- Distribution of 'repeated_guest' count 36275.00000 mean 0.02564 std 0.15805 min 0.00000 25% 0.00000 50% 0.00000 75% 0.00000 max 1.00000 Name: repeated_guest, dtype: float64
---------------------------------------------------------------------------------------------------- Distribution of 'no_of_previous_cancellations' count 36275.00000 mean 0.02335 std 0.36833 min 0.00000 25% 0.00000 50% 0.00000 75% 0.00000 max 13.00000 Name: no_of_previous_cancellations, dtype: float64
---------------------------------------------------------------------------------------------------- Distribution of 'no_of_previous_bookings_not_canceled' count 36275.00000 mean 0.15341 std 1.75417 min 0.00000 25% 0.00000 50% 0.00000 75% 0.00000 max 58.00000 Name: no_of_previous_bookings_not_canceled, dtype: float64
---------------------------------------------------------------------------------------------------- Distribution of 'avg_price_per_room' count 36275.00000 mean 103.42354 std 35.08942 min 0.00000 25% 80.30000 50% 99.45000 75% 120.00000 max 540.00000 Name: avg_price_per_room, dtype: float64
---------------------------------------------------------------------------------------------------- Distribution of 'no_of_special_requests' count 36275.00000 mean 0.61966 std 0.78624 min 0.00000 25% 0.00000 50% 0.00000 75% 1.00000 max 5.00000 Name: no_of_special_requests, dtype: float64
----------------------------------------------------------------------------------------------------
3.1.2 Analysis of Categorical Features¶
# Making a list of all categorical variables ('object' or 'category')
cat_cols = df_eda.select_dtypes(include=["object", "category"]).columns
# Iterate through each categorical column and plot the distribution
for column in cat_cols:
print("Distribution of '", column, "'")
print(df_eda[column].describe())
labeled_barplot(df_eda, column, perc=True)
print("-" * 100)
Distribution of ' type_of_meal_plan ' count 36275 unique 4 top Meal Plan 1 freq 27835 Name: type_of_meal_plan, dtype: object
---------------------------------------------------------------------------------------------------- Distribution of ' room_type_reserved ' count 36275 unique 7 top Room_Type 1 freq 28130 Name: room_type_reserved, dtype: object
---------------------------------------------------------------------------------------------------- Distribution of ' market_segment_type ' count 36275 unique 5 top Online freq 23214 Name: market_segment_type, dtype: object
---------------------------------------------------------------------------------------------------- Distribution of ' booking_status ' count 36275 unique 2 top Not_Canceled freq 24390 Name: booking_status, dtype: object
----------------------------------------------------------------------------------------------------
Multivariate analysis¶
cols_list = df_eda.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(15, 7))
sns.heatmap(
df[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()